How to serve billions of Web requests per day – without breaking a sweat

A new technology stack for modern, real-time, data-driven applications also leaves plenty of room to grow

Today, the AppLovin mobile advertising platform handles some 20 billion ad requests each day — at up to 500,000 transactions per second — as it helps brands acquire new customers and re-engage existing ones. How does AppLovin create an infrastructure that can handle billions of requests and that scales without significant increases in hardware and people?

This article shares the company's discoveries and the best practices it has adopted to select and evolve the technology stack to scale the business.

Cost-effective scaling

Let’s start with scale. For us at AppLovin, cost-effective scaling means being able to handle increases in load without requiring a linear increase in hardware or people. It wouldn't really help if we had to double the amount of requests we handle but needed twice the number of servers and twice the number of people. As we have made our infrastructure more efficient, we have actually reduced the number of servers needed to handle 20 times the impressions over the last year.

AppLovin scale AppLovin

It is also important to realize that, at large scale in a distributed system, there is no off-the-shelf solution; this type of system needs to be built from custom components. No matter what the industry, the technology team will have to build a distributed system out of carefully selected components. Moreover, every time there is a major increase in load, those components will change as well.

The need to plan for change means the infrastructure has to be flexible. We realized from the start that the mobile ad world was going to change quickly, and we needed a flexible infrastructure that could adapt. We wanted to build pieces that would allow us to innovate no matter what the market needed. For example, if we needed to do retargeting, we wanted to build that on top of an infrastructure, without having to reinvent the whole infrastructure. That’s exactly what we did.

The approach has paid off. We recently doubled our traffic in the course of a month. Having a flexible infrastructure enabled us to do that.

Adaptable, scalable real-time infrastructure

With those architectural requirements in mind, we have built an infrastructure stack that includes Web servers, a real-time caching layer, databases, distributed messaging services, and massively parallel compute systems.

At the front end are hundreds of Web servers. These servers are answering billions of requests per day. As each request comes in, we have to make a decision whether we want to bid on this impression, how much to pay for it, and which ad to serve — and we make that decision in about 50 milliseconds.

Next we need to cache the user profile information for the billions of users who have mobile phones. This information needs to be available in a very small period of time in order for those Web servers to respond and decide whether to bid on a given ad request. In short, what’s needed is a distributed caching layer to immediately serve out data for all incoming requests. Systems such as Aerospike, Redis, and Memcached are used for the caching layer.

Beyond that, a rich set of analytics, reporting, data warehousing, and data science functions need to be able to access different types of databases. At large scale, these functions must be distributed. To enable that distribution we use distributed messaging or publish/subscribe messaging services. Distributed messaging gives us key advantages:

  • We can take information from anywhere that it comes in.
  • We can use log files as a transactional unit to deal with hundreds of thousands of requests per second.
  • We can have any service subscribe to the information it needs.

The messages have to be distributed out to anywhere in the world. It could be to an HP Vertica data warehouse, a MySQL database, an Apache Hadoop system, or an Apache Storm real-time processing system. Distributed messaging is a key piece of any real-time architecture.

Finally, we need distributed computation to process data. Having distributed computation means using the likes of Hadoop or Apache Spark — a parallel processing system that can see all of the data and scale to handle huge data loads.

All of the components listed above are connected through a distributed log-based architecture. The fundamental idea behind the log-based architecture is that it has a number of sources, and it has log files as the transactional unit, coming out of all the sources. For example, an ad server might log, "Did I serve an ad? Did the user click? Did I see a transaction?” It writes out that log, chunks of logs are dumped into the message system, they go somewhere, they get processed, and the data gets written into databases. All of the data is available in the logging system for subscription by any service that needs the data.

The reason this type of architecture enables innovation is that you can plug any component in to that system. You might need to plug in an Aerospike real-time database in certain places; you might want to plug Vertica in others. You must be able to get all of that incoming information out to any of those different tools. Having a log-based architecture enables us to hook up all of our data sources through logs into a centralized logging system for real-time subscription.

Evaluating technology options

The evolution of our platform is a good example of why it is important to have a flexible infrastructure.

We actually started building with PHP. It was very fast to do so, and it was easy to find developers who knew how to work with it. The same was true with MySQL, and at the same time MongoDB was one of the top NoSQL databases, so we used that as well. Of course, as a startup, we initially built most of our platform on Amazon Web Services. Finally, we used RabbitMQ for our publish/subscribe messaging.

Over time, we’ve migrated our data to a combination of Aerospike, Redis, Apache Cassandra, Vertica, and Hadoop systems. We’ve transitioned from PHP to C++, and we have moved from RabbitMQ to a custom Java-based system for our messaging. At the same time, we’ve pared down our use of other systems to a relative few that we really understand and the engineering team knows how to deal with.

Bringing in new software is an expensive proposition. Whether it's open source or licensed software, if you try something and it doesn't work, it sets you back for months. So we carefully evaluate whether a product is worth it.

One of the first actions we take is to look at who is using a product already and what kind of proven use cases exist. For example, when evaluating Aerospike, someone at another ad-tech company mentioned his experience. We then talked to four or five other Aerospike customers and asked, “Is their case similar to what we are doing? What do they like about it? What don't they like about it?” Then we asked, “When I put this in my shop, what's going to happen that's going to surprise me?”

The other element to review is developer momentum. That is especially true with any open source project, but it also applies to a commercial product. Questions to ask include: “Are there developers who are writing about, using, and documenting this? Will this product have the kind of trajectory that you want to be on? Do I know someone who's using that system? And if it's an open source system, do I have a way to address my issues?”

For example, compare Apache Storm to Apache Spark right now. These are both usable as real-time computation-processing systems. Which one has more developer momentum? It's an important measure.

Next is driver fit. In other words: Will this thing fit into my system? For example, if we are using PHP or Python or C++, is this new software natively integrated with that language? Can we write tools that will actually access the APIs for this component?

Then, consider whether the product fails nicely. Particularly in our case, where we have several data centers distributed around the world, it’s important to know when things are failing. Some products don’t raise an alarm if they are breaking. Those are dangerous.

The last issue to consider is platform risk — committing a lot of resources to something that will cost the organization later. For instance, if a company is not an experienced .Net shop, does it make sense to adopt any technology related to the .Net platform? If the technology is Java based, do we have the resources to maintain the scaffolding around it? If we are committing to Amazon Web Services or Google Compute Engine, are we confident that these cloud platforms are moving in the direction that we want to be going in the next three years?

The technology a company implements may be viewed as either an advantage or a disadvantage to a potential customer, partner, or investor. Ultimately, the platform's goals have to match your business's goals.

John Krystynak is the founder and CTO of AppLovin. He's worked in engineering at NASA and SGI, and he helped found Internet advertising pioneer NetGravity in 1995.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Related:
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.