A Look At Simple Update Protocol (SUP)

The increasingly popular FriendFeed is proposing a new protocol known as Simple Update Protocol (SUP). The problem FriendFeed is encountering is noting new. They monitor a RSS feeds over a variety of services for each user. This can really add up. To keep things timely they poll them frequently. Generally speaking this is a very wasteful process since the majority of those feeds likely didn’t change. That’s wasted resources. SUP in a nutshell is a changelog for feeds so that a service like FriendFeed can check only the ones that changed. This allows for quicker updates with less polling. Here’s my analysis of the proposal.

Gripes:

  • I’m not sure I agree with the authors decision to use JSON as the format. Considering this will be used mostly (if not only) to keep tabs on xml documents (RSS, ATOM mostly) it seems more correct to use XML. Presumably the reason for JSON is that it’s computationally easier to parse. The NYTimes created a database abstraction layer called DBSlayer that uses JSON rather than a binary protocol to avoid the need for a client. The advantage of JSON over xml is JSON is easy to parse, and I agree. JSON also tends to be lighter since you don’t have opening and closing tags surrounding all your data. Still, why introduce JSON to XML world?
  • SUP needs index files. Google made a great move with it’s Sitemaps Protocol by allowing for Sitemap Indexes. Essentially it’s a bootstrap that lists several index files. Google further restricted that a Sitemap shall contain no more than 50,000 items and be no larger than 10MB. For a very popular site such as Facebook the SUP feed could become painful to parse as 1 file. Having a SUP Index would be much better. I’d essentially copy Google’s design and rules regarding size, I think they work rather well.
  • SUP should allow for both using SUP-IDs or RSS URL’s. SUP could be more useful if it were an index of all RSS feeds in some cases. Having that option in the protocol would make sense and future proof. I’m sure Google wouldn’t mind it.

Likes:

  • Unlike using XMPP which can be tough for a startup to implement (look at Twitter). HTTP is “native” and obvious. It also allows for using things like Gzip encoding to cut down on bandwidth and all other nice things that HTTP allows for. It’s also firewall friendly, something that TCP based solutions often aren’t. XMPP can also be very resource intensive, a supplemental feed really isn’t.
  • Would cut down on unnecessary feed polling. As noted by FriendFeed they would still need to poll, but the interval would be significantly higher resulting in less requests.

It’s important to note the difference in computational resources “requests”, “bandwidth”, and “CPU”. It seems a lot of people commenting on the proposal have confused them. This proposal would reduce requests and bandwidth and CPU provided enough consumers supported SUP.

Every time a consumer reads an RSS feed on a dynamic site (assuming no cache exists), the database is being hit to get the latest items. Even if a If-Modified-Since header is sent with the request, the site still needs to check if there’s something newer than that date. For this reason, using If-Modified-Since, while conserving bandwidth doesn’t help much to reduce the number of requests or CPU required. SUP works around this by only hitting one feed, and then only retrieving known updated feeds. Clearly If-Modified-Since and SUP aren’t doing the same thing. They are however somewhat complementary.

To generate this SUP feed, one must essentially do one of two things:

  • On a “cron” or some other interval based method query the database for updates and generate a feed. That includes feeds someone may or may not request (think about all those accounts you created on a website and never again visited). Sharded databases may also add more complexity. This can be pretty ugly.
  • When a change takes place, for example when a user adds a photo on Flickr, update a table in a database containing the SUP-IDs of changed feeds. Then on interval generate a feed using that one table. Periodically that table needs to be flushed of records no longer needed. Generating the feed off of that table is substantially easier. Of course one can argue if that’s “correct” or not from a data modeling point of view. It’s not really normalized, but than again, how many production databases really are, at least when performance matters?

This isn’t exactly a brand new idea. Six Apart has an Atom based update service, Twitter uses XMPP (though it looks like they may phase that out). LiveJournal at one point had a TCP based system that did essentially the same thing.

Overall SUP isn’t really a bad way of doing things. In a sense, it’s Google Sitemaps for feeds. It solves a problem that today doesn’t have a great solution.

2 thoughts on “A Look At Simple Update Protocol (SUP)

  1. Isn’t this what HEAD requests and Last-Modified-Date headers are for?

    If the server has resource constraints, it should keep a cache of the last modified dates for frequently-requested URLs. The work needed to keep this cache would be the same or less as the work needed to make a “feed of updates”, and would require no client-side changes. Instead of a “table in a database containing the SUP-IDs of changed feeds” have a “table in the database containing the Last-Modified-Date of feeds”. And send back Not Changed responses as appropriate.

    Gerv

  2. @Gerv: Not really. To get Last-Modified-Date you still need to do a http request for each and every feed. When your talking about a large site like Flickr or Facebook you can be talking about millions of unnecessary requests. This is enormously wasteful of resources and slow unless you want to hammer Apache.

    By having a list of what’s been updated, they could retrieve content at a quicker interval with much fewer requests.

    Adding more caching doesn’t speed things up, it just creates more data latency. The goal is to get fresh data as quickly and efficiently as possible. This proposal accomplishes that. Using Last-Modified-Date still means making millions of unnecessary calls and a lot of wasted time.

Leave a Reply

Your email address will not be published. Required fields are marked *