AWS went out Tuesday—and so did half the internet. AWS is more reliable than most people’s datacenters and has all sorts of great features your average corporate IT infrastructure lacks. Yet utter failure visited S3 on the East Coast and lasted for a whopping 11 hours.
The major cloud vendors really, really want you to depend on them. Sure, you can hedge your bets by using multiple cloud regions, but ultimately this is infrastructure you don’t control and in many cases don’t really understand. This is an age-old problem now ported to the cloud.
Disaster recovery must be real time
Traditional DR isn’t what we’re looking for in the age of the internet. Most companies have DR plans that include multihour and multiday recovery times. That's not good enough.
Why do enterprises persist with these time frames? The idea is outages occur due to “disaster.” Whether it's climate change (which some people pretend doesn’t exist unless their own money is on the line), earthquakes, terrorism, fiber cuts, or something someone fat-fingered somewhere, an extraordinary event has occurred! Meanwhile, whomever depends on your service is sorely disappointed.
During events like these, customers make emotional decisions. You may not have thought about checking out Lyft, but if Uber goes down you might try it. (Or maybe you were horrified by the CEO’s disastrous backseat dancing.) The point is when a service goes wonky, you try its competitor. If you find the competitor more reliable, you may switch.
Avoid single points of failure
Obviously, you want East Coast traffic to go to the East Coast cloud region by default, but if the East Coast cloud region goes down, you need a fallback strategy. That means DNS and other load-balancing infrastructure need redundancies that span regions.
Moreover, it means you need the data to get where you need it. This means WAN replication and the capacity to handle increased traffic at another zone.
Yet this isn’t enough. While multizone failures are rare, they happen. And if everyone gets smart and redirects all their traffic to the up-zones, do you really trust any cloud provider to plan the capacity to handle it? I mean, maybe they can. But given how little they disclose about why incidents happen, do you trust them for real?
Maybe you shouldn't use every whiz-bang toy AWS puts out and write truly relocatable microservices—and seamlessly fail over to an entirely different cloud if things aren’t working out. Not only another region but maybe you lean on Microsoft or Google. That means developing services that run on more than one from the very start. You can’t simply wait for a problem to occur. You need to live multicloud now.
All of this means you need to develop a system that tolerates latency. Your fake cloudification, where you stuck SQL Server in the cloud and plonked a load balancer in front, isn’t likely to cut it without real thought and reengineering.
Nothing is new, except when it is
The great thing about all this is that none of it is new. It's simply cheaper and easier to do it now.
In the past, companies set up multiple datacenters, WAN replication, redundant DNS, clustered services ... all these have been doable, albeit less affordable in the past. But today’s database technology, microservices architecture, and vastly improved software make redundancy much cheaper and easier. But don't get hung up on using only one provider's offerings.
Gee, that sounds familiar.