Roblox’s cloud-native catastrophe: A post mortem

How Roblox chased down and fixed the flaws in its HashiCorp-powered distributed infrastructure that caused a three-day worldwide outage.

Roblox’s cloud-native catastrophe: A post mortem
Ig0rZh / Getty Images

In late October Roblox’s global online game network went down, an outage that lasted three days. The site is used by 50 million gamers daily. Figuring out and fixing the root causes of this disruption would take a massive effort by engineers at both Roblox and their main technology supplier, HashiCorp.

Roblox eventually provided an amazing analysis in a blog post at the end of January. As it turned out, Roblox was bitten by a strange coincidence of several events. The processes Roblox and HashiCorp went through to diagnose and ultimately fix things are instructive to any company running a large-scale infrastructure-as-code installation or making heavy use of containers and microservices across their infrastructure.

There are a number of lessons to be learned from the Roblox outage.

Roblox went all in on the HashiCorp software stack.

Roblox’s massively multiplayer online games are distributed across the world to provide the lowest possible network latency to ensure a fair playing field among players that might be connecting from far-flung places. Hence Roblox uses HashiCorp’s Consul, Nomad, and Vault to manage a collection of more than 18,000 servers and 170,000 containers that are distributed around the globe. The Hashi software is used to discover and schedule workloads and to store and rotate encryption keys.

Rob Cameron, Roblox’s technical director of infrastructure, gave a presentation at the 2020 HashiCorp user conference about how the company is using these technologies and why they are essential to the company’s business model (the link takes you to both a transcript and a video recording). Cameron said, “If you’re in the United States and you want to play with somebody in France, go ahead. We’ll figure that out and give you the best possible gaming experience by placing the compute servers as close to the players as possible.”

Roblox’s engineering team initially followed a series of false leads.

In tracking down the cause of the outage, the engineers first noticed a performance issue and assumed a bad hardware cluster, which was replaced with new hardware. When performance continued to suffer, they came up with a second theory about heavy traffic, and the entire Consul cluster was upgraded with twice the CPU cores (going from 64 cores to 128) and faster SSD storage. Other attempts were made including restoring from a previous healthy snapshot, returning to 64-core servers, and making other configuration changes. These were also unsuccessful.

Lesson #1: Although hardware issues are not uncommon at the scale Roblox operates, sometimes the initial intuition to blame a hardware problem can be wrong. As we’ll see, the outage was due to a combination of software errors.

Roblox and HashiCorp engineers eventually found two root causes.

The first was a bug in BoltDB, an open source database used within Consul to store certain log data, that didn’t properly clean up its disk usage. The problem was exacerbated by an unusually high load on a new Consul streaming feature that was recently rolled out by Roblox.

Lesson #2: Everything old is new again. What was interesting about these causes is that they had to do with the same kinds of low-level resource management issues that  have haunted systems designers since the earliest days of computing. BoltDB failed to release disk storage as old log data was deleted. Consul streaming suffered write contention under very high loads. Getting to the root cause of these problems required deep knowledge of how BoltDB tracks free pages in its file system and how Consul streaming makes use of Go concurrency.

Scaling up means something completely different today.

When running thousands of servers and containers, manual management and monitoring processes aren’t really possible. Monitoring the health of such a complex, large-scale network requires deciphering dashboards such as the following:

roblox normal consul Roblox

Lesson #3: Any large-scale service provider must develop automation and orchestration routines that can quickly zero in on failures or abnormal values before they take down the entire network. For Roblox, variations of mere milliseconds of latency matter, which is why they use the HashiCorp software stack. But how services are segmented is critical too. Roblox ran all of its back-end services on a single Consul cluster, and this ended up being a single point of failure for its infrastructure. Roblox has since added a second location and begun to create multiple availability zones for further redundancy of its Consul cluster. 

One of the reasons Roblox uses the HashiStack is to control costs.

“We build and manage our own foundational infrastructure on-prem because at the scale that we know we’ll reach as our platform grows, we have been able to significantly control costs compared to using the public cloud and manage our network latency,” Roblox wrote in their blog post. The “HashiStack” is an efficent way to manage a global network of services, and it allows Roblox to move quickly—they can build multi-node sites in a couple of days. “With HashiStack, we have a repeatable design pattern to run our workloads no matter we go,” said Cameron during his 2020 presentation. However, too much depended on a single Consul cluster—not only the entire Roblox infrastructure, but also the monitoring and telemetry needed to understand the state of that infrastructure.

Lesson #4: Network debugging skills reign supreme. If you don’t know what is going on across your network infrastructure, you are toast. But debugging thousands of microservices isn’t just checking router logs; it requires taking a deep dive into how the various bits fit together. This was made especially challenging for Roblox because they built their entire infrastructure on their own custom server hardware. And because there was a circular dependency between Roblox’s monitoring systems and Consul. In the aftermath, Roblox has removed this dependency and extended their telemetry to provide better visibility into Consul and BoltDB performance, and into the traffic patterns between Roblox services and Consul.

Be transparent about your outages with your customers.

This means more than just saying “We were down, now we are back online.” The details are important to communicate. Yes, it took Roblox more than two months to get their story out. But the document they produced, drilling down into the problems, showing their false starts, and describing how the engineering teams at Roblox and HashiCorp worked together to resolve the issues, is pure gold. It inspires trust in Roblox, HashiCorp, and their engineering teams.

When I emailed HashiCorp public relations, they responded, “Because of the critical role our software plays in customer environments, we actively partner with our customers to provide our recommended best practices and proactive guidance in architecting their environments.” Hopefully your critical infrastructure provider will be as willing when your next outage occurs.

Clearly, Roblox was pushing the envelope on what the HashiStack could provide, but the good news is that they figured out the problems and eventually got them fixed. A three-day outage isn’t a great outcome, but given the size and complexity of the Roblox infrastructure, it was an awesome accomplishment nonetheless. And there are lessons to be learned even for less complex environments, where some software library may still be hiding a low-level bug that will suddenly reveal itself in the future.

Copyright © 2022 IDG Communications, Inc.