Whenever we talk about distributed denial-of-service attacks, we typically focus on internet of things botnets and shadowy attackers sending large volumes of junk traffic against networks and applications. For many organizations, however, self-inflected DDoS attacks pose bigger dangers, a pair of Google engineers warned.
Poor software architecture decisions are the most common cause of application downtime, Google’s director of customer reliability engineering Dave Rensin and Google site reliability engineer Adrian Hilton wrote on the Google Cloud Platform blog. The mistakes aren’t in the code, but in the assumptions the developers make about applications and user interactions, especially as they relate to system load.
“We think it’s a convenient moment to remind our readers that the biggest threat to your application isn’t from some shadowy third party, but from your own code!” Rensin and Hilton wrote.
Software developers typically assume that load will be evenly distributed across all the users. That isn’t a bad assumption, but if the developer doesn’t have a contingency plan to deal with situations when load is not evenly distributed, “things start to go off the rails,” the engineers said.
For example, a mobile app developer writing code to fetch data from a back-end server every 15 minutes may not think twice about also putting in logic to retry every 60 seconds if the app encounters an error. On the surface, this looks reasonable and is actually a common pattern. What many developers fail to consider, or plan for, is what happens if the back-end event is unavailable for a minute.
During that minute, there will be apps trying to fetch the data and encountering an error. When the back-end server shows up again online, it will be hit with the requests for data normally expected for that minute as well as the traffic from apps retrying after the one-minute delay. That’s double the expected traffic, and load is no longer evenly distributed because “two-fifteenths of your users are now locked together into the same sync schedule,” the engineers said.
For a given 15-minute period, the server will experience normal load for 13 minutes, no load for one minute, and double load for one minute.
Service disruptions usually last longer than a minute. In order to correctly handle a 15-minute service disruption, software engineers would need to provision at least 15 times normal capacity to keep the application server from falling over when coming back online. If the back-end server responds more slowly to each request because of the growing load on the load balancer, the total number of requests can exceed 20 times normal traffic during this time.
“The increased load might cause your servers to run out of memory or other resources and crash again,” Hilton and Rensin wrote.
Tricks to avoid self-DDoS attacks
Application developers can adopt three methods to avoid the self-inflicted DDoS attack: exponential back-off, adding jitter, and marking retry requests. Incorporating these methods can “keep your one minute network blip from turning into a much longer DDoS disaster,” the engineers wrote.
Exponential back-off refers to adding a delay that doubles with every failed retry attempt, to build in a longer time interval between failed connection attempts. Using a fixed retry interval almost guarantees that retry requests will stack at the load balancer. In the case of the above example, after the first one-minute retry, the app will wait two minutes, four minutes, and eight minutes, and 16 minutes, before going back to one minute. Exponential back-off lowers the number of overall requests queued from 15 to five.
Jitter, the act of adding or subtracting a random time period to the next retry interval to vary the timing of the next attempt, helps prevent apps from getting locked into the same sync cycle. In the example, if the next back-off interval is set to four minutes, adding jitter could result in the app sending a new retry attempt at some point between 2.8 minutes and 5.2 minutes instead of waiting a full four minutes.
“The usual pattern is to pick a random number between +/- a fixed percentage, say 30 percent, and add it to the next retry interval,” the engineers said.
In the real world, usage is almost never evenly distributed. Nearly all systems experience peaks and trough corresponding with their users’ work and sleep patterns, Rensin and Hilton said. Users may turn off their devices at night—which will result in a traffic spike when they wake back up.
“For this reason it’s also a really good idea to add a little jitter (perhaps 10 percent) to regular sync intervals, in addition to your retries,” the engineers said, noting that this would have the most impact for first syncs after restarting the application.
Finally, each attempt should be marked with a retry counter so that the application can prioritize client connections. In the case of a service disruption, servers don’t all come back online at the same time and the application’s overall capacity is limited at first. The application logic may focus on responding to requests with higher retry number because those clients have been waiting the longest, for example.
“A value of zero means that the request is a regular sync. A value of one indicates the first retry and so on,” Rensin and Hilton said.
System outages and service disruptions happen for a myriad of reasons. In a given 30-day month, a system consistently maintaining 99.9 percent availability can be potentially unavailable for up to 43.2 minutes, so a 15-minute disruption is completely within the realm of possibility. How the application recovers from these incidents determines how well it handles unexpected traffic volumes.