Google

Google essentially started in 1996 when Larry Page and Sergey Brin started work on”BackRub” and quickly morphed into the Google we know today. The premise of their search engine has been a mythological technique known as “Page Rank”. Which Google briefly discusses:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important.”

The Original Internet
Now this was designed for a very different Internet. The majority of sites were done by businesses, and educational institutions. Free information was everywhere, subscription services where somewhat rare, rich media virtually non-existent, blogs, personal sites, parody websites virtually non-existent (at least compared to today). The people mainly could give input by going to Usenet. There was a separation of the real, and the unreal on the Internet. There were also mailing lists, and guest books. The occasional web based forum started to POP up towards the later years.

The Internet Rebirth

Free Web Hosts become extremely prevalent during 1998, 1999 and continued through this new age (though some of the less business savvy hosts died). This provided an easy outlet for virtually anyone to create a web presence. Prior you either had to be a large company or be at an academic institution.

Around late 2000, early 2001 there was a rebirth on the Internet. It wasn’t quite as dramatic as the dot com bubble that recently burst, but it was rather revolutionary. It was a sign the Internet was maturing. So many things were evolving. Blogging started to become prevalent, those website like journals who could be written by anyone with little or no fact checking unlike the mammoth websites of before. They could look professional yet be little fact. They also featured comments, the ability for anyone who wanted to leave text, and links on the website typically below the main article.

Also of rising popularity (though existent before) were Internet Forums, thanks to the recent rise of Free Forum software it became possible for more than just a few larger organizations to have forums. Combined with dropping prices for bandwidth and disk space on servers, virtually anyone could have their own forum. Another web based (unlike Usenet) method which is essentially a user editable website where anyone can post comments and/or links.

The last major component of the new internet was the Wiki, while not new, this technology came to rise in 2001 when Wikipedia opened it’s doors and shocked the web. A community of average people taking care of a website. No longer confined to the corners of the web, Wikipedia as of this posting is #48 on Alexa’s list of top websites. Just about all the others are not user editable.

Corporate Changes
While the people were gaining power, the corporations that were kings of the original web also had some changes. Many decided to move some of their content to subscription models. So only paying customers can access the content. NY Times, Time Magazine, WSJ among many others made some content exclusive. The more professional content of the Internet was disappearing behind logins, while the less professional content was becoming more accessible.

Google
Google during this time had a massive shift into mainstream. Google’s indexing technique proved rather well up to this point. It found very relevant information very quickly. While many search sites were selling ranks, Google didn’t. It’s algorithm made the ranks. Google was unbiased in a very bias internet. Google could do no wrong.

Spam
Spammers soon realized how to abuse Google and other search engines. By creating bots that planted their links all over forums, blogs, and fictional web pages created for the purpose, they could essentially boost their page rank and achieve higher status on Google. This low cost method was rather effective for some time. Unlike email spam, this technique targeted search engines, no so much end users directly.

Fight back
Google and other search engines decided to fight back by adopting a new technique of preventing websites from gaining page rank by spamming. Using a HTML tag a webmaster can tell a search engine, these sites weren’t legitimately linked, and don’t deserve the bump in page rank.

Tomorrow
Now with a really concise rundown of the history it’s possible to look at the issues that lie ahead for techniques such as “Page Rank”. How does it compensate for the massive shift in content? It’s a design that was based on a much more honest Internet of organizations that linked to relevant content, not because of boosting page rank. It was an Internet of the privileged who researched, rather than the million+1 blogs of useless text. Now in the advent of rich media (audio/video), how does one even begin to analyze the data?

Does Bayesian filtering play a role? Is it possible to use this spam technique not only to fight off page rank abusers, but to rank based on legitimacy of data?

How does one define legitimacy in a mathematical means (the only language a search engine truly knows). And who defines it. In an age of media corruption as well as infinite bias sources, who defines what’s real or not? How does one prevent a particular ideology from gaining the upper hand in search results? Do a quick Google search for Jew to see how page rank can be abused. #2 on that list as of this posting is what most consider a hate site “Jew Watch” as it calls itself. Is it accurate information? Is it the best of the web? This page, gets a Google Rank of about #189. Is this accurate either? Is it a bias source? Are the 188 before it better in terms of research? I noticed several parody sites, a few that may be considered hate sites, several blogs, mailing list archives, and other user-editable pages. But what did this very basic search give me? I can of course pick a source and do a query, for example Time. This is slightly improved, but I only have access to some of their content. At least an editor has looked at the contents before it was published. Time has decades worth of reputation. But there is still a problem. I lost the key advantage of the Internet. I’m no longer searching on the interoperable network of computers that is the Internet. I’m searching Time Magazine. I’m not getting the most relevant thing of many sources, but the best that Time has to offer.

I choose this query as an example because it was somewhat recently (last year or so) in the news because of #2. But it’s far from the only query that works. Anything regarding politics, social issues, religious ideology, is all tainted data.

Is the Internet broken?
I wouldn’t say it’s broken, but I’d suggest it’s time we go back to our roots. Way back when each academic discipline published academic journals. Works published by academics, which were peer reviewed. In fact they are still done, just few every look at, or notice them. This concept is very relevant on the Internet and can be applied many ways.

Peer Reviewed Internet
This would simply be the option to rate sites you visit when you search, and have that data funneled back into the search engine for future use. Sites people find relevant get a higher rank, sites that stink, get punished. This obviously has some risk of abuse from things like bots, or simply a bias world (just because people disagree with something, such as another political standpoint, doesn’t mean it’s a bad source).

Peer Reviewed Search
This in my mind would be the most promising technique. Putting together a peer organization to rank and approve websites based on their integrity, and allow queries on approved websites. This puts organizations with established standards of journalism above that of blogs and allows web users to do queries based on those. The key to this would be to establish a standard benchmark (such as how information is collected, presented, ownership of information). This would be similar to my example of a query against time.com, but open to all sites that are professional organizations with peer approval.

Blog Search
Technorati is currently my favorite for this. I really wish they would partner with Google as they have very complementary technology. It would be nice to do a Google query of the blogosphere and see how people react to certain things. A query of the blogosphere on “cloning” for example. Perhaps even an algorithm that based on links and text can even give me a summary of the content (such as Weighted Categories says a lot about the Blog itself). A nutshell overview of the blogs ideology, slant, and focus. Based on content, not some meta tag (keywords, description) which can be forged. Nobody is “fair and balanced”, that’s a marketing term not a business practice.

This isn’t a new concept, Google Groups has done this for years already, just Usenet is not mainstream anymore. We’re querying history rather than the present.

Conclusion
In a world of exponential data growth, and a user editable Internet, it’s not possible to give relevant search based on search techniques designed for the Internet of several years ago. The quality of search has dropped significantly, and will likely continue. What is necessary is to break down the information based on its source, and its quality, and allow users to easily search groups based on this. Ideally a search engine would be smart enough to automatically distinguish reputable content and less reputable content. Ideally a search engine would know enough to realize what authors are well respected, and who isn’t. Ideally a search engine could decide the best approach to searching, and use that technique, rather than use 1 technique to index and search.

The search engine wars are on. It will be interesting to revisit this post and see how things have changed in time.

Google, MSN, and Yahoo… plus a ton of blog developers sat down and came up with a fix. And there talking about rapid rollout on this one. Google Blog has the details.

Basically you need to have your blogging product of choice ad

<a href="URL" rel="nofollow">LINK</a>

to any link a visitor can add themselves (trackback, comments, etc). That will tell the search engines not to boost their rank based upon the linking. As a result spamming weblogs will serve no purpose. There will no longer be a page rank increase.

I’ve already hacked WordPress to cover part of this. It won’t do within comment fields, but will do so when you enter a website into the URL field when filing a comment.

Sorry spammers, the world decided: GO AWAY. We don’t like you, never have, never will. Your a bunch of “businesses” with unethical business plans (I have business in quotes since most aren’t even businesses, they are just people trying to scam someone out of some cash).

Thanks to:

Google, Yahoo, MSN, LiveJournal, Scripting News, Six Apart (MovableType), Blogger, WordPress, Flickr, Buzznet, blojsom, Blosxom .

It’s good to see widespread coordination.

Now what about email spam? When will they come up with a DomainKeys, SPF, solution.