8 dark secrets of cloud hardware

Where is your code running, what is it running on, and why did it stop? You may never know

8 dark secrets of cloud hardware
Christopher Alzati (CC0)

Long ago, a server was something that was yours and yours alone. You and your team would scrutinize the specs, collect bids, fill out a purchase order, and then take delivery on the machine so it could be carefully installed and tested in the server room just down the hall from your desk. You and your team could walk over, touch it, check that the LED was burning bright, and feel secure listening to the quiet hum of the fan. You might even polish the front panel with a shirt sleeve.

Now you might not have anything to do with your hardware. Some people still click on a webpage of a cloud company to create an “instance” but many of us leave all of the work of starting up a server to an automatic script run by some continuous integration and deployment bot. At most, we spend a few moments debating the size of the instance when we configure the build routine, but after that the work is left to one of our robot deployment routines. This software may even be clever enough to negotiate the auctions for spare cycles to minimize the costs, all without us doing a thing.

The disconnection with our hardware is growing even deeper as the “serverless” buzzword grows more common. Of course, the companies don’t literally mean there’s no server in the loop. They just mean that you shouldn’t worry your little head about anything to do with those boxes of chips whirring away somewhere else. Just give us your few lines of code and we’ll make sure that some piece of silicon in our back warehouse will run it.

Many of these mysteries are labor-saving and stress-saving innovations. Being left in the dark means not wasting our time thinking about any of the details about memory configuration or drive partitioning or whether that broken DVD-ROM tray matters. Skipping these thoughts is a good thing. Developers have worked long and hard to build agile tools and bots so we can skip having staff meetings to discuss and review the annoying issues.

But sometimes a bit too much is being swept under the rug. Sometimes a bit too many details are elided from the discussion before we click that button and agree to the bazillion terms of that seemingly endless contract that no one ever reads.

The good news is that many times these details won’t matter. We’ve stopped worrying about them because we’ve crossed our fingers and it all worked out in the past. It was a good gamble to ignore them before and so we roll the dice again.

But sometimes the mysteries are worth considering in case our code happens to be the one time when it matters. The one time in one hundred, one thousand, or one bazillion when we should have asked a few more questions.

We’re not saying that you should be paranoid. We’re not saying you should be staying up late worrying about these N things. But if you do find yourself unable to sleep, here are N mysteries of modern hardware to ponder when you have nothing better to do.

Where is the server?

It’s in the cloud. That may be all we know. The companies may say our instance is running in New York or Karachi but that’s all we get. Often the best we can do is know the name of the city or maybe just the country.

Should we care about the street address? Maybe the murky location of the building itself is a security feature not a bug. If we don’t know the physical location of the box, well, the bad guys will be just as confused. It’s not like we will ever touch the box or listen to the hum like we did when we took the suits on a tour of the server room.

The thing is, some of us actually need to fret about the physical location of the data center. We worry about tax laws or legal issues with jurisdiction. Some of us need to worry about export laws or letting our data cross a border. Some of us have lawyers calling us with questions like this. Some of us have to deal with subpoenas.

What is the CPU?

Remember pondering whether you wanted a sixth-generation chip or if you could justify splurging on a seventh-generation hot rod? Remember looking at the rows and rows of benchmark numbers and dividing the cost by the speed? Remember how much fun it was to brag about upgrading to the fourth-generation CPUs when you’re out to lunch with Chris who was forced by the bean counters to squeeze another year out of the third-generation chips?

Now the odds are good that you won’t know the manufacturer or the model number or any detail whatsoever about the CPU. The cloud companies sell you instances with cryptic names like “m1” or “large,” but that doesn’t mean much. The “m1” and the “m2” may not have anything to do with each other. They’re just names.

Some cloud companies try to measure the “virtual” CPU power you’re buying and then let you dial up just the right amount. It might have something to do with the number of cores on the machine—something that affects your threading and parallel algorithms—or it might not. It all might be a facade that measures only the amount you’re purchasing.

Sometimes the hardware makes a difference. Sometimes there are security holes or glitches that can be traced to particular chips. The “Hidden God Mode” vulnerability affected the VIA C3 set of x86 chips. Sometimes we need to know about threading models and cores just to make our algorithms run faster. There are dozens of little and not-so-little problems like this. We can cross our fingers because the cloud companies should stay on top of this for us. Or so they say.

What kind of memory?

Long ago we thought about whether it was worth it to install faster memory with more error correction circuitry. Long ago we wondered whether some RAM was better or more stable than others. Long ago we chose certain RAM manufacturers over others and had opinions about brand names and technological approaches.

Now we’ll never know just how good the hardware might be. This is one thing the cloud company engineers are supposed to worry about so we don’t have to. But did they? We’ll never really know. Maybe our instances are crashing because of bad RAM. Maybe it’s because of our own terrible code. We’ll never know.

What kind of drives?

Some cloud companies will brag about using SSDs. Some will brag about using faster hard disks. Some will just rent us 25 gigabytes of storage and not get into the details. But not all disk drives have the same reliability ratings. Not all flash memory is created equal. Did our code fail because of some sticky flash cell that’s been overwritten too many times? Or was it that new programmer who desperately wanted to push the new code? It’s not for us to worry about anymore. We just boot up another instance and move on.

Even transistors aren’t simple

RAM is perhaps the simplest part of the whole machine and it comes with semantics that are basic and boring. In go some bits paired with an address and out come the same bits when presented with the address.

Alas the transistors may seem to be digital devices that store only two values, but that’s only in the theory section of the textbook. In real life, they are inherently analog circuits and this can lead to some scary leaks. Researchers are discovering clever techniques like Rowhammer and RAMBleed and ingenious hackers are figuring out how to exploit them remotely. If we can’t trust the basic semantics of RAM, what can we trust?

Other chips are even more mysterious 

Most people spend even less time thinking about the rest of the computer. We talk about the CPU and sometimes the GPU, but does anyone outside the networking team discuss the NPU, the Networking Processing Unit? It sits there quietly moving the data with such devotion and unflappability that everyone forgets it exists. But the NPUs have firmware of their own and the clouds have elaborate, reconfigurable networking layers with some of the most sophisticated semantics around. While we’re fussing over abusing branch prediction and Rowhammer, has anyone spent much time thinking about what a hacker can do with a network card?

What kind of technology?

Sometimes we don’t even know the right buzzword to use to describe a service. Amazon’s Glacier storage is one of the cheaper places to park your bits, but Amazon won’t explain what technology they’re using. Is it built from racks and racks of slow magnetic hard disks? Or perhaps they burn the data onto stacks and stacks of Blu-ray disks? Or maybe they use magnetic tape loaded by robot arms? Maybe they’ve used two or three different technologies so they could switch the cost curves around? It’s all a mystery. All we know is how much it costs per gigabyte and how slowly it might take to retrieve the information.

 What’s going on?

Sometimes we’ll never find out what’s going on at all. Moving to the cloud doesn’t remove the dangers of bad events like power outages, imploding disk drives, or ransomware, but it does cut us off from learning what’s going on. In our server rooms, everyone is on our team and everyone reports to the same boss. They might not always tell us the truth, but they’ll generally be more forthcoming.

In the cloud, we probably won’t know anyone who is handling the problem. At best, we’ll communicate through emails and trouble tickets. Even then, the lawyers, the managers, and the PR flacks get in the way and the only thing we get is carefully worded CYA. At best, we’ll learn “mistakes were made.” At worst, we’ll hear nothing.

A good example of this confusion is this story about a recent ransomware attack on the QuickBooks accounting data. Customers who believed the marketing rhetoric about the carefree life of letting the cloud handle the data were left to wonder what really went on. The same kind of attack could easily have taken down our data center but at least we would know the names of the people and we might see them at the company picnic.

Copyright © 2019 IDG Communications, Inc.