robots.txt

There’s a fair amount of controversy regarding Phorm a company who plans to target advertising by harvesting information via deep packet inspection. They are already in talks with several ISP’s. I’ll leave the debate over Phorm from a user perspective for someplace else.

They claim to offer ways to let websites opt out of their tracking but it’s a true double-edged sword as they don’t play nice with a standard robots.txt file. Take a look at what they are doing here:

The Webwise system observes the rules that a website sets for the Googlebot, Slurp (Yahoo! agent) and “*” (any robot) user agents. Where a website’s robots.txt file disallows any of these user agents, Webwise will not profile the relevant URL. As an example, the following robots.txt text will prevent profiling of all pages on a site:

Rather than use a unique user agent they are copying that of Google and Yahoo. The only way to block them via a robots.txt file is to tell one of the two largest search engines in the western world not to index your site. This seems fundamentally wrong.

There is an email address where you can provide a list of domains to exclude, but that requires intervention and updating a list of domains when you create a site. This obviously doesn’t scale.

Now I’m curious. Is piggybacking off of another companies user agent considered a trademark violation? From what I understand they aren’t broadcasting it, just honoring it. If I were Google or Yahoo I’d be pretty annoyed. Particularly Yahoo since there are websites who will just block Slurm given Google’s dominance in search. Yes there are many user-agent spoofing products out there (including wget and curl), but nobody to my knowledge is crawling web pages for a commercial purpose hiding behind another company name.

robots.txt is a somewhat flawed system as not all user agents even obey it (sadly) though it’s one of the only defenses without actual blocks that exist.