Another airline, another bad day -- weeks after “performance issues” at Southwest Airlines led to hundreds of flight cancellations or delays, a datacenter fire resulted in a systemwide failure at Delta Airlines.
Having seen a few airline back ends, I'm not very surprised. Are such outages avoidable? Nearly always. You simply have to create a high-quality system with sufficient resilience and redundancy baked in. It helps to have good processes and programmers, too.
Reliable, redundant infrastructure
Obviously no matter how great your software and servers, if you don't have redundant network infrastructure, you may not survive. This starts with your switches and routers, includes such equipment as load balancers, and goes a bit beyond. For instance, backup power -- from a reliable provider that routes your power appropriately.
High availability (HA) protects against such problems as a server or service failing. Let's say the runtime your web service is built on does a core dump. No user should know -- they should simply be routed to another service. To accomplish this, you need to run multiple instances and load balancers, then replicate appropriately.
Disaster recovery (DR) protects against more unlikely events. Often, DR is implemented with relatively low expectations (sometimes hours or even days before recovery) and/or acceptable data loss. I’d argue that on the modern internet, recovery should be minutes at worst -- same goes for data accessibility. To accomplish DR, you need another datacenter that’s geographically distant and an ironclad failover scheme. You also need to get the data there.
If you use a cloud hosting provider like AWS, opt for servers in multiple regions in case disaster strikes. Some larger organizations go as far as using more than one cloud infrastructure provider.
Rules of replication
Replication comes in multiple forms. There is “now” replication, which you can accomplish between two servers in the same room or (at worst) a few miles away. Then there’s WAN replication. The two are fundamentally different.
We’re used to thinking in throughput. (How big is your pipe? 10 gigabits? 100 gigabits?) But that ignores latency. As distance increases, so does latency. If you have a datacenter in New York City and one in Arizona, your data will take time to travel even if it’s replicated as fast as possible. You won’t use a transactional distributed cache with ACID transactions between the two sites because your data will not get there faster than the speed of light. It’s 2,331 miles from New York City to Arizona, so with light traveling at 186,000 miles per second, you’re talking at least 80ms of latency.
In reality, the internet operates slower than the speed of light even across fiber, because switches and packet routing add overhead. Thus, you need a form of replication that doesn’t hold up server performance, but makes a best effort to transport the data with certain expectations of loss in the event that aliens take out your datacenter, "Independence Day"-style.
This involves queues and buffers and such. Many NoSQL databases such as Couchbase or Cassandra and caches such as Infinispan or Gemfire have this feature. Accomplishing this with a traditional RDBMS is a bit more complex and usually requires additional layers (which add latency).
Fewer, more skilled developers
Applications that don’t blow up also help bolster reliability, so let's talk about software quality. First off, I’ve yet to see a project where scores of inexpensive developers produce good software. The extreme programming people were right: You need small teams with achievable goals. More accurately, you need fewer, more productive developers rather than scores of cheap labor.
Tons of offshore development shops advertise their capability maturity level (CMM). But to paraphrase Steve Jobs, don’t confuse content with process -- you can have a perfect process and still produce absolute crap. Imagine if we described the process of painting and brought in some skilled masters. Imagine if we replaced them with twice as many people who’d taken one art class.
Skilled developers avoid building software vulnerable to such attacks as SQL injection and write more efficient code (which makes it more computationally costly to launch a DoS attack against you). Skilled developers avoid using overhyped technologies for their own sake; they also avoid clinging to old technology that doesn’t do what's required because of “tradition.” Skilled developers look at the trade-offs and make good decisions. Skilled developers make fewer errors.
With fewer, more productive developers, we also avoid excessive communication and coordination, which means fewer misunderstandings.
Quality control and good development practices
It’s 2016 -- use Jenkins, people! Also, write unit tests and end-to-end tests, and perform load tests as part of your process. I know that adds up, but geez, how much has this outage cost Delta?
One big, fat failure justifies it all. Once a seemingly intractable performance problem that “only happens in production” creeps into your code, you’ll bleed money. Unit tests, load tests, and functional tests are all cheaper eventually. The business justification is that stuff happens -- and when it does, it costs a ton of money. We want to reduce the likelihood of stuff happening.
In summary, brush your teeth, eat your vegetables, deploy decent infrastructure, and use decent development and deployment practices with decent developers. Do these things and life will go well. Fail at any of this and you’ll fail like an airline.