Scaling telemetry monitoring with InfluxDB

How a team at Cisco tapped IOS-XR, a multi-processing collector agent, and InfluxDB to create a telemetry monitoring system capable of ingesting 3TB of telemetry data per day.

Scaling telemetry monitoring with InfluxDB
Monsitj / Getty Images

User expectations for software applications keep rising. Nowadays, services are expected to be highly reliable and perform well 24/7. Any kind of downtime is going to result in frustrated users and hurt your business long-term.

A key component in improving reliability is monitoring your application. While setting up basic monitoring is easy, having the ability to scale monitoring efficiently as traffic to your service grows is a major challenge. You also want visibility into every important metric for your service and the ability to make the data you are collecting useful and actionable with the ability to query and analyze it efficiently in real time on demand.

In short, there’s a big difference between the problems you run into throwing together something for a side project or small scale system vs. deploying telemetry monitoring at scale in a production environment.

One team at Cisco experimented with InfluxDB to create an example of a scalable telemetry monitoring architecture that other companies with large-scale production environments could draw on, without having to start from scratch. This setup allowed Cisco to scale up its telemetry data ingestion to 3TB per day (or around 16GB per minute). At the core of this architecture is Cisco IOS-XR and InfluxDB.

Cisco telemetry monitoring architecture overview

There are three main components in Cisco’s telemetry architecture. The first part is the Cisco hardware running IOS-XR, which produces the telemetry data. The second part is the collector agent that takes in that data and then sending it to the final component for storage, which is accomplished with InfluxDB.

scaling telemetry 01 InfluxData

Cisco IOS-XR

IOS-XR is the operating system used by Cisco for its high-end, carrier-grade routers such as the CRS series, 12000 series, and ASR 9000 series network routers. Compared to other network operating systems, IOS-XR provides improved availability, better scalability for large hardware configurations, the ability to install upgrades or patches while the router remains in service, and numerous other features not available in competitors.

One particularly relevant feature is that IOS-XR provides integrated streaming of telemetry data to increase network visibility and has APIs available for engineers to take action based on telemetry data.

For this architecture, Cisco streamed data from three different IOS-XR platforms: the NCS 5500, ASR 9000, and the 8000 series router. Cisco had the devices configured to run in dial-out mode, with self-describing GPBs (Google Protocol Buffers), over a TCP connection. One of the key factors in a telemetry monitoring architecture at this stage is making sure it doesn’t collect more data than it needs in terms of overall metrics as well as the frequency of metric collection.

Collector agent

The telemetry data from the IOS-XR hardware was sent to a load balancer, which then forwarded the data between three different collector agents. At large scale, single-threaded collector systems will not be able to handle the amount of data being sent to them. Multi-threaded collectors also have issues because they are all uploading to the database with separate connections, which creates another set of problems.

To get around these problems Cisco wrote a multi-processing collector agent, with the code being open source on GitHub. The collector agent’s main process is decoupled from the worker pool, which parses the data and uploads it to InfluxDB. The main process adds data to a queue as it is streamed in and then sends the telemetry data to the worker pool in batches. The collector agent is able to handle gigabytes of data per second, while remaining reliable due to this decoupled architecture. This can be seen in the diagram below.

scaling telemetry 02 InfluxData

InfluxDB

The final piece of the telemetry architecture is InfluxDB, which is used to store the data. For this experiment, InfluxDB was deployed with two data nodes and three meta nodes to form a cluster to support improved reliability and performance.

InfluxDB is a purpose-built time series database designed to handle massive volumes of time-stamped data, which made it a perfect fit for Cisco’s telemetry monitoring use case. InfluxDB also works great for any workload that requires being able to write large amounts of data and being able to query that data in real-time. Common use cases include IoT, analytics, and application monitoring.

InfluxDB is open source and can be deployed on your own infrastructure or set up in minutes using InfluxData’s cloud offering, InfluxDB Cloud. InfluxDB Cloud is a fully-managed, elastic time series data platform that allows users to get started quickly and then easily scale to meet their requirements. Ingested data can be displayed using InfluxDB Cloud’s built in dashboards and data can be queried using Flux, InfluxData’s composable, functional query language designed for time series workloads.

For Cisco’s use case, it made a few changes to InfluxDB’s standard configuration to optimize it for their specific needs. The first was adjusting the default cache (buffer) memory size. Because they were writing data in batches from the collector agent, InfluxDB needed a larger amount of memory set aside so it would persist that data while it was being written. At the cluster level, Cisco also chose to allow out-of-order replica writes to be made between nodes. This allowed more flexibility in the relationship between data arrival order and the points’ accompanying timestamps.

Scaling telemetry data is a difficult task that many companies have tried to solve on their own. Cisco’s goal in this experiment was to provide a blueprint architecture for other companies to follow so that they don’t have to reinvent the wheel for their own use case. A core part of Cisco’s solution was InfluxDB because of its performance, ease of use, and open source code base.

Sam Dillard is senior product manager of IoT and enterprise at InfluxData.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2021 IDG Communications, Inc.

How to choose a low-code development platform