Reliability On The Grid

There’s been a lot of discussion lately (in particular NYTimes, Data Center Knowledge) regarding both reliability of web applications which users are becoming more and more reliant on, as well as the security of such applications. It’s a pretty interesting topic considering there are so many things that ultimately have an impact on these two metrics. I call them metrics since that’s what they really are.

Defining uptime, security, and privacy

For the intents of the discussion at hand, “uptime” is defined as the application being accessible and functional to the user. Note putting a “fail whale”? image up so that the page loads doesn’t not qualify as functional. For all intents and purposes the service is down. One should also note that traffic goes through different routes to get to different users, hence a site can be down for one person, but up for millions of others. The vast majority (95%+) should be able to use the service for it to really be considered “up”.

“Security” defined as the assurance that privacy, data integrity, and account access are restricted in accordance with typical site functionality and users understanding. “Privacy” is defined as not allowing any unauthorized person or entity to manipulate, view, copy, handle, destroy, or know about the existence of data without explicit approval from the user.

Why applications fail

Applications can fail for many reasons, but most can be lumped into a handful of categories. At the highest level you have in-house and upstream reasons. In-house can be defined as something you can control, for example software or servers you control, while upstream is typically a vendor or partner, for example ISP, colo facility, etc, which there is less control over (other than submitting a ticket). Generally startups have more upstream services and bring more things in-house as time goes on. For example, Facebook relies on colo facilities for their servers. They now plan to build their own (more control, and hopefully will ensure lower costs as well).

On a lower level you break things down to hardware and software. Hardware failures are inevitable. Computers suck in a 24×7 environment. We deal with that since they are better than people, who still insist on sleep (lazy bastards). Hard drives fry, motherboards fail, fans die resulting in “thermal events”?. Generally it’s pretty easy to deal with this. You can use RAID so 1 hard drive isn’t critical after all, moving parts are the most prone to failure. You can also have more than one server powering a successful application. If one dies, the load goes to other boxes running on the grid. You can put them in different data centers so if there was a problem at one, your still up and running. This obviously comes at a cost. Services like Google App Engine, Amazon’s S3 and Amazon’s EC2 help lower the cost, but also result in hardware being handled by an upstream provider. Amazon and Google are very redundant, but they too can and have failed.

Software generally fails because it either wasn’t designed to scale, or it was hatefully put together to meet a deadline. Startups are infamous for this as the business guys just want things done quick and cheap and don’t care about reliability until it’s too late (they will also deny this until the end of time). All major software platforms can scale when done correctly. Many people say Perl can’t scale, but it has for a decade, look at IMDB, Amazon and Slashdot among the many. Even more claim PHP can’t scale, but Facebook and Yahoo seem to run fine. Python (YouTube), Ruby-On-Rails (YellowPages.com, Hulu, 43things) ASP.NET (MySpace and Microsoft) all seem workable in high traffic situations. It’s not what you use, but how you use it. These run on Apache, IIS, Oracle, MySQL, among others. The platform is rarely (if ever) the problem. The implementation almost always is.

There’s also the possibility that everything is fine and dandy, but somewhere along the internet from the servers to some of your users there’s a problem. ISP’s encounter tons of problems with people snagging their fiber and tearing a line, to DoS attacks and viruses reeking havoc. When this happens close to the user, no sites are accessible, when this happens further away several sites may be inaccessible or slow. Users often wrongly attribute this to a site or application being slow or down when that’s hardly the case. Using a data center with good connectivity reduces these cases. Having data centers distributed around the globe is even better, but often not economical. The best a business can do is submit a ticket and wait. If it’s frequent enough they can move somewhere else.

Why security fails

Security failure is almost too complex of a topic to discuss without holding a complete college course. The most obvious answer is that someone is cleverer than the person in charge of security, and outwitted or outsmarted them. It could be in the physical form (stealing a server or hard drive with data), or in the electronic form (Phishing, XSS, DoS). It could be a “hacker”, or it could be an application failure that results in a security glitch.

Many websites take several measures to protect your privacy. They require “strong” passwords, maybe even require you to change them. For things like banking you may have “security questions” to answer. Perhaps even a key fob to provide two factor authentication.

Most security failure can be traced to stupidity. For example using “password” for your password, or replying to an email asking for your password. A poorly configured server can also be a vulnerability. Then all you need is someone who wants to exploit that. If the data is of any value, that person exists.

When businesses fail

Hackers want your data, business want to keep it secure, but don’t want to spend too much time/effort on it since the formula is time = money. There’s really not much more to explain here.

Sometimes it’s not even the business you know your dealing with. You may be working with company X, but they may use company Y, Z, A1, A2, A3, and A4 to actually provide their services. Your data may be accessible by any or all of them.

Then there’s the possibility of a business going out of business. They may give you a chance to download your data and move it elsewhere or they may even do it for you. They may also just shut down abruptly and disappear of the face of the earth. Goodbye data.

Take control of your data

I may sound cynical for effectively saying applications fail, many people could potentially see your data, and there’s nothing you can do about it. I’m not, I am a realist, and I know what goes on behind the scenes. There actually is something you can do about it: Take control of your data. Keep control of your data.

Know who has your data, what they might do to it, who they might share it with, and what they will do to protect it. Companies (at least reputable ones) post privacy policies for a reason. Check them out or Google for some info on that company. The results may surprise you. For example if you delete something from Google Docs it may take 3 weeks for it to actually be deleted on their servers. This isn’t uncommon, but many people assume once you delete it, the company deletes it. That’s not the case.
Think about accessibility. What happens if that application has an outage? What happens if your ISP has a problem? Or your cable line got cut? Using an online office suite is a great way to keep documents accessible from home or work, but not so great when you can’t access them. Storing them on a USB drive may prove useful, at least as a backup. If you’ve got a business, this is especially true. You may also want to consider a 2nd way to get online should your ISP have problems (giving a wireless card and a laptop to certain employees may also have the perk of allowing employees to be more mobile).
Decide the fate of your data. I personally prefer to keep a copy of everything so if a company goes under, I still have my data. I host my own blog, and my own photos. I keep backups of all that too, in multiple locations. I know I’ll be around as long as I care about keeping that data online. I’m not going out of business. If I am, I don’t care about that data anymore 😉 . I always have my data. You should too.

Keep control of your data

Just because you’ve figured out how to protect your data, doesn’t mean you’re done. You need to reevaluate yourself every time you start using something else, or change your usage patterns. You don’t have to keep your data offline, just understand what putting it online really means. Offline backups aren’t a bad idea. Having backups on another service is also an option, but may be even more complicated.

This is somewhat more complicated in the case of things like social networks, but things like Data Portability are slowly becoming a reality.

Google in general has been pretty good with leaving the options to take your data back. Gmail lets you use IMAP to download all your mail, Google Reader lets you export an OPML feed, Google Docs lets you save all your docs to your computer. It’s important to know what the services you rely on let you do with your data. Don’t just assume you can easily get it out.

You’re responsible for your fate

It’s easy to blame Google, Microsoft, Yahoo, or Twitter for your problems, but that’s really a poor excuse. You’re responsible for the choices you make, and what you rely on. If what you’re relying on isn’t giving you what you need, you need to find something else, or reevaluate if your putting your priorities in the right place.

I now present to you…

Accettura’s Law Of Business Computing

where
people = prone to frequent failures
technology = expensive, complex, frequent failure

business computing = people + technology = complex frequent failures that are costly in nature.

You can see how this works right? Best way to avoid that cost? Make sure your technology is redundant, and your people’s interaction is controlled to prevent failure from leaking into the technology.

This should be in Wikipedia and every Business and CompSci textbook. That way everything that a student touches or thinks about in this industry is done with this in mind. Build with the knowledge in mind that the fail whale will just make you a relic before you even hit your prime.

That said, get over Twitter being down and stop complaining.