Amazon S3 Outage

The buzz around the web today was the outage of Amazon’s S3. It shows what websites are “doing it right”, and who fails. This is a great follow up to my “Reliability On The Grid” post the other day.

Amazon S3 is cloud based computing. Essentially when you send them a file using their REST or SOAP interface Amazon stores it on multiple nodes in their infrastructure. This provides redundancy and security (in case a data center catches fire for example). Because of this design it’s often though that cloud based computing is invincible to problems. This is hardly the fact. Just like any large system, it’s complicated and full of hazards. It takes only a small software glitch, or an unaccounted for issue to cause the entire thing to grind to a halt. More complexity = more things that can fail.

Amazon S3 is popular because it’s cheap and easy to scale. It’s pay-per-use based on bandwidth, disk storage, and requests. Because that allows for websites to grow without having to make a large infrastructure investment, it’s popular for “Web 2.0” companies trying to keep their budgets tight. Notably sites like Twitter, WordPress.com, SmugMug and Amazon.com themselves all use Amazon S3 to host things like images.

Many sites, notably Twitter, and SmugMug didn’t have a good day today. WordPress.com and Amazon.com operated like normal. The obvious reason for this is WordPress.com and Amazon.com are much better in terms of infrastructure and design.

WordPress.com uses S3, but proxies that with Varnish. There’s a brief description here, and a more detailed breakdown here. According to Barry Abrahamson, WordPress.com does 1500 image requests per second across and 80-100 are served through S3. They have (slower) back up’s in house for when S3 is down and can failover if S3 has a problem. This means they can leverage S3 to their advantage, but aren’t down because of S3. Using Varnish allows them to keep the S3 bill down by using their own bandwidth (likely cheaper since they are a large site and can get better rates on bandwidth). This also and lets them have this have a good level of redundancy. Awesome job.

Amazon.com uses S3 themselves. If you look at images on the site, they are actually served from g-ecx.images-amazon.com. Which is actually:

g-ecx.images-amazon.com. 38     IN      CNAME   ant.mii.instacontent.net.

instacontent.net is actually part of Mirror Image, a CDN. This is essentially outsourcing what WordPress.com is doing in terms of caching. It’s similar to Akamai’s services. A CDN’s biggest advantage is lowering latency by using servers closer to the customer, which are generally going to feel faster. The other benefit is that they cache content for when the origin is having problems. Because Amazon has a layer on top of S3, they have an added level of protection and remained up and images loaded.

Twitter serves most images such as avatars right off of S3. This means when S3 went down, there were thousands of dead images on their pages. No caching, not even a CNAME in place. Image hosting is the least of their concerns. Keeping the service up and running is their #1 concern right now. The service was still usable, just ugly. Many users take advantage of third party clients anyway.

Using a CDN or having the infrastructure in house is obviously more expensive (it makes S3 more of a luxury than a cost savings measure), but it means your not depending on one third party for your uptime.

Reliability On The Grid

There’s been a lot of discussion lately (in particular NYTimes, Data Center Knowledge) regarding both reliability of web applications which users are becoming more and more reliant on, as well as the security of such applications. It’s a pretty interesting topic considering there are so many things that ultimately have an impact on these two metrics. I call them metrics since that’s what they really are.

Continue reading

Drobo for network storage?

Drobo initially didn’t impress me to much, but after watching a demo I’m somewhat impressed. The positives:

  • The hotswapping, RAID-like (but not RAID) redundancy is awesome. That’s perfect for backup/bulk storage purposes.
  • Transfer isn’t bad (Up to read 22MB/s write 20MB/s)
  • Power consumption idles at about 12 watts which isn’t bad.
  • Adding storage capacity is really easy.

There are some downsides:

  • No Linux support. Which stinks if you were to hook it up to an old PC running Linux and use Samba. You could of course use a Mac.
  • Pretty expensive $499 isn’t cheap for a glorified drive enclosure. You still need a host, and drives.

Of course for true backup you need to offsite your data, but you can do that through standard means, and using Amazon’s S3. So your covered there.

The downfall of this product is the lack of a 10/100 Ethernet port. It would likely have been pretty cheap (lets face it network devices are pretty cheap these days) and would have removed the need for a PC. You could of course hook it up to a Access Point such as the Airport Extreme… but you don’t get the greatest level of control with these.

Ideally a real cheapo Linux machine (Intel Celeron, 1GB RAM, 80GB HD) with a Drobo would be an awesome backup solution. You could then use MRTG to graph network/data storage usage, manage usage, quota’s or whatever else you wanted to do. Even a media server. Backup some data with S3? No problem. Could even setup something like BackupPC to backup entire PC’s.