Microservice tracing: Root cause performance issues in clusters and distributed platforms

As we move to distributed clusters and the benefits of microservice-based cloud applications, there are new challenges to overcome

Microservice tracing: Root cause performance issues in clusters and distributed platforms

Microservices are a popular architectural approach for cloud-native applications. But the idea of deconstructing a large service into smaller componets was originally conceived for clusters and distributed platforms—when applications were trying to increase compute performance and grow storage and network not available on a single host.

Once the boundry of a server was crossed, an application’s software components required interaction via inter-server “east-to-west” communications. As this concept developed and was applied to modern-day cloud services, building blocks such as JSON, RESTful API and Thrift were added to create what we now know as microservices.

+ Also on Network World: Modular software creates agility — and complexity +

Big data analytics and the Internet of Things (IoT) are two of the most prominent examples of applications that are exploiting microservices, giving rise to what can be referred to as “east-west” application workloads. In sharp contrast with traditional “north-south” workloads, east-west workloads are horizontally distributed, in-data processed and programmed, and executed by moving compute to data (as opposed to the other way around). Software-defined storage and networks help provide cluster-wide resource accessibility wherever compute is carried out.

As we have moved to distributed clusters and the benefits of microservice-based applications, we are finding new challenges. How do you secure or authenticate all of the application’s components? How do you use them within containers? How do you trace down workload performance and resource efficiency issues across distributed servers and clustered resources?

The most difficult technical challenge: microservice tracing

Of these, microservice tracing is probably the most difficult technical challenge. With a distributed workload, what matters most is end-to-end throughput and latency through the entire system, and how resources are efficiently and fairly shared among different applications. In other words, while per node optimization and insight are still important, how cross-microservices optimization is achieved often has a bigger impact on end-to-end performance.

Let’s use an example for an e-commerce site. A click stream analytics platform is deployed to instantly analyze a consumer’s navigation and provides real-time pop-ups for merchandise predicted to be interesting based on the consumer’s past purchases. These pop-ups could include product categories, brands and prices. In this case, web services are connected to click stream analytics, which is further connected to a back-end database. A sudden rise in number of website visitors could overwhelm the analytics platform, while many long and busy clickstream sessions could overwhelm the database. Understanding the inter-microservices behavior and performance profiles is the only way to improve end-to-end performance.

To dig into investigating performance issues, it is important to note that there are three key factors that enable us to deterministically gain the insights of a distributed workload.

First, microservice-enabled workloads by nature consist of a number of microservices, each executing certain tasks, which then pass on results to the next. Given this, the workload of any one microservice is highly dependent on the nature of the input results of the previous one and requires that we have the capability to do end-to-end workflow-based performance data collection, monitoring and correlation.

Second, the performance and resource efficiency of each microservice unit can be a significant contributor to inter-microservices throughput and efficiency. This suggests a performance management paradigm that is divided into two tiers, with the top tier handling overall end-to-end efficiency and the bottom measuring granular resource and performance efficiency.

Finally, in analyzing the root cause of what might be affecting the speed of an application distributed across a cluster, understanding the specific microservice path traversed for a particular transaction becomes necessary. This is where we need microservice tracing.

So, what is microservice tracing? It is the ability to trace and examine the “in operation” performance and resource status of every microservice executed by a particular distributed transaction. It could include the ability to closely examine both ends of how two microservices are interconnected, which could be on the same host or across different virtual machines, or across two different servers. For instance, a heavily loaded front-end microservice with a lightly loaded network link could be an indication that the front end is CPU bound. But a congested network link with dropped packets from the front-end microservice could mean the back end is network bound.

To trace, analyze and diagnose LANs, we developed RMON and later RMON2. Now, as we move to clustered resources and componentized applications for distributed cloud sevices, we need the same ability to trace, analyze and diagnose microservices.

This article is published as part of the IDG Contributor Network. Want to Join?