The top 3 serverless problems and how to solve them

Follow these tips to eliminate noncompute bottlenecks, avoid provider throttling and queueing, and keep your serverless functions responsive

The top 3 serverless problems and how to solve them
Getty Images

Serverless computing is all the rage. Everyone who is anyone is either investigating it or has already deployed on it. Don’t be last in line or you’ll be sure to miss out!

What’s all the fuss about? Serverless computing gives you an infrastructure that allows server resources to be applied to a system as necessary for it to scale, effectively providing compute horsepower as a utility to be consumed by the system as load demands.

This means that nobody needs to care about individual servers at runtime (frankly, nobody ever did care about them). Economies of scale make it cost effective to outsource managing a fleet of servers to a cloud provider, while the “serverless” interface makes that outsource relationship as simple as possible by minimizing the contract.

Being human, the immediate reaction of many people is to try and replace the charts, traffic lights, and alerts they attached to their servers with charts, traffic lights, and alerts relating to their individual serverless functions. Sadly, however, this does not fundamentally address the application management challenge. Because just as nobody really cared about servers, nobody really cares about the serverless functions in isolation either.

What people do care about is the level of service the system is delivering to its users. This means that monitoring, to be valuable, must focus on things that might go wrong, and in the context of serverless, “go wrong” mostly means an attempt to violate the laws of physics, since “run out of server capacity” has effectively been taken off the table.

So what are the typical serverless problems, and how do they manifest? Here are three big issues that plague serverless deployments and ways to mitigate them.

Cold start cost

Problem: This is an often-discussed issue with serverless systems. To maximize the utilization, the serverless providers sometimes choose to entirely shut down inactive functions. When the load resumes, the function start-up cost causes a hit to the response time. When one business function is composed of many serverless functions chained together, this effect can be compounded.

Mitigation: Many users artificially ping their functions to ensure they stay alive. To effectively apply this strategy with a network of chained services, it is imperative to understand the end-to-end relationships between them, so all the services in the dependency chains are kept active, making end-to-end tracing of business transactions essential.

Throttling

Problem: Serverless platforms limit the number of concurrent requests that serverless functions will execute, often both at the account and the individual function level. Once the concurrency limit is hit, further user requests will be queued, causing extended response times. While it might seem counterintuitive to throttle our effectively unlimited pool of compute resources, this does prevent exposure to potentially unlimited bills (don’t forget, capacity is billed on a consumption basis).

Mitigation: Raise the thresholds! Or at least, make sure you set them wisely to meet whatever non-functional requirements you have in terms of response time and concurrent usage. Again, end-to-end visibility is required so you can tell precisely what was throttled, and what impact throttling had on the end-user experience.

Noncompute bottlenecks

Problem: OK, so you removed all the serverless throttles, and now your functions will support as many concurrent requests as you can imagine. Problem solved! Sadly, it’s more a case of problem moved. Sooner or later, your functions will need to persist some state somewhere. Depending on where that is, you may have more trouble in store. Sooner or later, you will need to wait to read or write data, meaning your infinitely scaled lambdas will all be waiting for data access—while you’re being billed for the unproductive waiting.  

The exact reason your functions are hanging around like this will vary by persistent store. 

  • Cloud data storage: Cloud data services are becoming increasingly elastic, but many still require you to configure resources based on concurrent read and write volumes. 
  • Traditional systems: No function is an island, and many enterprise users of serverless are wrapping existing systems with serverless functions (sometimes mainframes, sometimes conventional server-based deployments). While it’s easy to raise thresholds to allow functions to scale, that simply means it’s easy to overwhelm whatever back end you have that cannot scale so easily.

Mitigation: To make sure your back-end systems can handle the theoretical maximum load, tune them in conjunction with your function throttles. This will help you ensure the system works smoothly end-to-end, so you avoid unnecessary costs and customer dissatisfaction. In some cases, you may need to replicate data to enable different systems to access it from different places. (Of course, this comes at the cost of increased data management complexity, and the risk of inconsistencies creeping into the data.)

Again, understanding the end-to-end flows through your system at the transaction level is critical to identifying and alerting on bottlenecks in production, and also to analyzing the system from end to end to aid tuning.

Serverless ops is devops

I hear what you’re saying, “But I’m a developer! Why should I care about this crazy, nonfunctional deployment stuff?”

Here’s the biggest serverless implication of all: the configuration now is the code (or at least part of it). What developers deliver to production in the serverless world is the whole package, not just functional pieces.

This, in turn, means that where once you debugged production issues with your IDE, in the serverless world you had better familiarize yourself with some kind of performance management solution. At least half of your “bugs” could turn out to be deployment-related. It’s no longer “their fault” (i.e., Ops). The system’s fate is entirely in your hands—and end-to-end visibility at the application level will become critical to you.

Peter Holditch is senior principal product manager at AppDynamics

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2019 IDG Communications, Inc.