Well I did some work on it today. It’s now in extension form (the old version, prior to Ben Goodger’s changes). Also using a “database” (array) of 18 keywords right now, with a fair amount of success.
Now the big topic will be creating a RDF schema and a method for scanning efficiently, and “fuzzy”. Allow me to expand:
We can’t just ban the page because of the word “ass”, but the word “ass”, and several other words could be potential page worth blocking. So what needs to be done is attach point values to all words (scientifically). Then based if the point value gets higher than 5.0, we block it. This is basically how SpamAssassin operates. So what I need is for someone to do some experimentation, and find out exactly what keywords to use, and what point values to attach to them. A nice thing would be a little C++ app that could be used to generate scores based on data. I’m rather open to suggestions on how to do this. So… give suggestions, code solutions. Submit them to me, be a hero.
The RDF schema also needs to contain a method field. Since regEx is extremely slow, and bloated, we obviously don’t want to do that more than we need to. So we have the option to use window.find(). By using that method, there’s a speed increase (with obvious limitations).
Perhaps in the future, changing the core engine to compiled binary would be better, but for now, we make do with javaScript. So far performance on a 1.8GHz system is actually not much slower at all, I really don’t notice it. But we will need some more keywords. I figure about 50-100, provided we use a scoring system like mentioned above.
So code is coming, hopefully an initial checkin soon, I’m just not ready yet, and busy. I’ve had about 3hrs today of free time to play, and that was my break from the academic books. More to come, but lets get the creative juices flowing.