Categories
Google Internet

Who Indexes Tweets

I was curious who is indexing the links that people tweet on Twitter. It’s obvious someone does since links get ‘clicks’ almost immediately after submission. To do this presumably they are tapping into the xmpp firehose.

Lets take a look:

66.xxx.xxx.xxx - - [06/Dec/2009:20:17:43 +0000] "GET /test HTTP/1.1" 301 20 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I guess Google has a deal with Twitter. Googlebot indexed just a few seconds after it was sent. As far as I know nothing is actually announced. This is the first evidence I know of a potential deal of some sort. I’d be shocked if Google is scraping the site this quickly.

Edit: Stephen Duncan pointed out in the comments that this was announced in October. Totally forgot about that.

208.xxx.xxx.xxx - - [06/Dec/2009:20:17:47 +0000] "GET /test HTTP/1.0" 301 - "-" "Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly.html) Gecko/2009032608 Firefox/3.0.8"

This is Topsy, a twitter search engine. Never saw this site before. Few tests and I actually kind of like the output.

89.xxx.xxx.xxx - - [06/Dec/2009:20:17:58 +0000] "GET /test HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot"

Tweetmeme mines Twitter links and attempts to build a Digg-like index based on retweets rather than Diggs.

75.xxx.xxx.xxx - - [06/Dec/2009:20:18:05 +0000] "GET /test HTTP/1.1" 301 - "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
72.xxx.xxx.xxx - - [06/Dec/2009:20:20:25 +0000] "GET /test HTTP/1.1" 301 - "-" "Python-urllib/2.5"

Can’t identify these AWS hosted services.

70.xxx.xxx.xxx - - [06/Dec/2009:20:20:53 +0000] "GET /test HTTP/1.1" 301 20 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
70.xxx.xxx.xxx - - [06/Dec/2009:20:24:23 +0000] "GET /test HTTP/1.1" 301 20 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

This is actually Microsoft. Microsoft’s Bing search engine indexes Twitter. I’m not sure why they indexed twice in such close intervals that seems odd for this day and age.

Mining logs a little deeper it looks like when tweets meet certain criteria (such as retweeted) there are other bots that spider them. It also looks like other search engines may be indexing at a slower rate (Baidu for example).

There are several others from AWS and a few other dedicated providers. These servers are obviously trying to keep a low profile, they don’t even have reverse DNS.

So there you go. Just a matter of seconds after a link hits Twitter this all happens.

Here’s a few more from another Tweet that weren’t in the first set:

Edit: More!:

75.xxx.xxx.xxx - - [06/Dec/2009:20:49:42 +0000] "GET /test HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; Feedtrace-bot/0.2; bot@feedtrace.com)"

Feedtrace is some sort of twitter mining service currently in beta.

67.xxx.xxx.xxx - - [06/Dec/2009:20:49:45 +0000] "GET /test HTTP/1.0" 301 - "-" "Mozilla/5.0 (compatible; mxbot/1.0; +http://www.chainn.com/mxbot.html)"

Chainn is a social data mining service with a few apps that make use of the data it collects.

Categories
Google Networking

Google DNS Privacy Policy

John Gruber among others note that Google DNS service is not tied to Google Accounts. That’s not just wording in their privacy statement, it’s technically impossible for them to do otherwise, at least with reasonable accuracy.

Your computer is associated with a Google account via a cookie given to you when you login. Cookies are sent back to Google’s servers as HTTP headers whenever you fetch something from the host that set the cookie (every request, even images). They can only be sent to that domain, nobody else.

DNS doesn’t operate over HTTP, and therefore can’t tell what Google Account you’re using.

Google could however use your IP address you used to login to your Google Account and associate it with your DNS activity, but that would make the statisticians at Google cringe. So many homes and businesses have multiple computers behind a NAT router. Google DNS is unable to distinguish between them. Even one computer can have multiple users.

Before someone jumps up and says “MAC address”, the answer is: NO. To keep it simple a MAC address is part of the “Data Link Layer” of the OSI model (Layer 2) and is used to address adjacent devices. Your MAC address is only transmitted until the first hop which would be the first router on your way to Google. Each time your data makes it to the next device on its way to Google the previous MAC header is stripped off and a new one is added. By the time your bits get to Google that packet of data has only the last hop’s MAC address on it. Many people confuse Layers 2 and 3.

Categories
Google Mozilla

Google Goes HTML5

I just noticed that Google is now serving it’s homepage with an HTML5 doctype:

< !doctype html>

I suspect this might have changed when they launched that new fade effect. I also noticed they are doing so when using the new YouTube “Feather” beta. This shouldn’t be too surprising considering their involvement in the HTML5 specs and developing a web browser and announcing it’s moving away from Google Gears.

Of course the pages don’t validate, and don’t really take advantage of much HTML5 features (that I’ve seen at least). But it’s a step in the right direction. With modern browsers like Firefox, Chrome, Safari becoming more popular it’s slowly becoming a reality.

Categories
Google Networking

Google Public DNS Analysis

Google’s new Public DNS is interesting. They want to lower DNS latency in hopes of speeding up the web.

Awesome IP Address

This is the most interesting thing to me. I view IP addresses similar to the way Steve Wozniak views phone numbers, though I don’t collect them like he does phone numbers.

[Querying whois.arin.net]
[whois.arin.net]
Level 3 Communications, Inc. LVLT-ORG-8-8 (NET-8-0-0-0-1) 
                                  8.0.0.0 - 8.255.255.255
Google Incorporated LVLT-GOOGL-1-8-8-4 (NET-8-8-4-0-1) 
                                  8.8.4.0 - 8.8.4.255

# ARIN WHOIS database, last updated 2009-12-02 20:00
# Enter ? for additional hints on searching ARIN's WHOIS database.

Looks like Google is working with Level 3 (also their partner for Google Voice I hear) for the purpose of having an easy to remember IP. From what I can tell it’s anycasted to a Google data center.

For what it’s worth, 6.6.6.6 is owned by the US Army. Make of that what you will.

NXDOMAIN

First thought is Google would hijack NXDOMAIN for the purpose of showing ads, like many ISP’s and third party DNS providers. Instead they explicitly state:

If you issue a query for a domain name that does not exist, Google Public DNS always returns an NXDOMAIN record, as per the DNS protocol standards. The browser should show this response as a DNS error. If, instead, you receive any response other than an error message (for example, you are redirected to another page), this could be the result of the following:

  • A client-side application such as a browser plug-in is displaying an alternate page for a non-existent domain.
  • Some ISPs may intercept and replace all NXDOMAIN responses with responses that lead to their own servers. If you are concerned that your ISP is intercepting Google Public DNS requests or responses, you should contact your ISP.

Good. Nobody should ever hijack NXDOMAIN. DNS should be handled per spec.

Performance Benefits

Google documented what they did to speed things up. Some of it anyway. Good news is they will still be obeying TTL it seems. My paraphrasing:

  • Infrastructure – Tons of hardware/network capacity. No shocker.
  • Shared caching in the cluster – Pretty self explanatory.
  • Prefetching name resolutions – Google is using their web search index and DNS server logs to figure out who to prefetch.
  • Anycast routing – Again obvious. They do note however that this can have negative consequences:

    Note, however, that because nameservers geolocate according to the resolver’s IP address rather than the user’s, Google Public DNS has the same limitations as other open DNS services: that is, the server to which a user is referred might be farther away than one to which a local DNS provider would have referred. This could cause a slower browsing experience for certain sites.

Google also discusses the security practices to mitigate some common security issues.

Privacy

Google says after 24-48 hours they erase any IP information in their privacy policy. Assuming you trust Google that may be better than what your ISP is doing though your ISP could still log by monitoring DNS traffic over their network. As far as I’m aware there are no US laws governing data retention, though proposed several times.

I am curious how this will be treated in Europe who does have some data retention laws for ISP’s. Does providing DNS, traditionally an ISP activity make you an ISP? Or do you need to handle transit as well? Does an ISP need to track DNS queries of someone using a 3rd party DNS? Remember recording IP’s alone is not the same thanks to virtual hosting. Many websites can exist on one IP.

OpenDNS and others may have flown under the radar being smaller companies, but Google will attract more attention. I suspect it’s only a matter of time before someone raises this question.

Would I use it?

I haven’t seen any DNS related problems personally. I’ve seen degraded routing from time to time from my ISP. Especially in those cases, my nearby ISP provided DNS would be quicker than Google. I don’t really like how nameservers may geolocate me further away, but that’s not a deal killer. I don’t plan on switching since I don’t see much of a benefit at this time.

Categories
Around The Web Audio/Video Funny

Surprised Kitty

Surprised Kitty

I’m amazed how quickly this stuff can get around the web. I’m pretty sure this is what Vint Cerf and Bob Kahn had in mind when working on TCP/IP.

Categories
Google Web Development

Google Is Moving Away From Google Gears

LA Times is reporting:

“We are excited that much of the technology in Gears, including offline support and geolocation APIs, are being incorporated into the HTML5 spec as an open standard supported across browsers, and see that as the logical next step for developers looking to include these features in their websites,” wrote a Google spokesman in an e-mail.

I complained a while back that things seemed too fragmented. To date I’ve been pretty leery of things because I wouldn’t want to support two competing methods or require users to either download Google Gears or use a browser that supports cutting edge technologies. It’s either too much effort and code footprint, or too much effort from the user perspective to download another binary.

A few Google folks replied to my earlier blog post and noted that they fully intended to work towards convergence. I’m glad it’s finally becoming a reality. I hope Google Gears will continue to be developed for the purpose of filling in missing functionality for certain browsers that tend to fall behind and simply let the browser take over if and when it eventually supports that functionality. That would create a consistent environment across platforms and browsers.