Jul 19, 2017 3:00 AM

How to monitor MongoDB database performance

You can keep your MongoDB environment running smoothly by keeping a close eye on six key metrics

Thinkstock

MongoDB is a favorite database for developers. As a NoSQL database option, it provides developers with a database environment that has flexible schema design, automated failover, and a developer-familiar input language, namely JSON.

There are many different types of NoSQL databases. Key-value stores store and retrieve each item using its name (also known as a key). Wide column stores are a kind of key-value store that uses columns and rows (much like a relational database), only the names of the columns and rows in a table can vary. Graph databases use graph structures to store networks of data. Document-oriented databases store data as documents, providing more structural flexibility than other databases.

MongoDB is a document-oriented database. It is a cross-platform database that holds data in documents in a binary-encoded JSON format (known as binary JSON, or BSON). The binary format increases both the speed and flexibility of JSON, and adds more data types.

MongoDB’s replication mechanisms help deliver high availability, and its sharding mechanism allows for horizontal scalability. Many top Internet companies such as Facebook and eBay use MongoDB in their database environment.

Why monitor MongoDB?

Your MongoDB database environment can be simple or complicated, local or distributed, on-premises or in the cloud. If you want to ensure a performant and available database, you should be tracking and monitoring analytics in order to:

  • Determine the current state of the database
  • Review performance data to identify any abnormal behavior
  • Provide some diagnostic data to resolve identified problems
  • Fix small issues before they grow into larger issues
  • Keep your environment up and running smoothly
  • Ensure ongoing availability and success

Monitoring your database environment in a measurable and regular way ensures that you can spot any discrepancies, odd behavior, or issues before they impact performance. Proper monitoring means you can quickly spot slowdowns, resource limitations, or other aberrant behavior and act to fix these issues before being hit with the consequences of slow websites and applications, unavailable data, or frustrated customers.

What should we monitor?

There are many things you can monitor in a MongoDB environment, but a few key areas will tip you off quickly if something is amiss. You should be analyzing the following metrics:

  • Replication lag. Replication lag refers to delays in copying data from the primary node to a secondary node.
  • Replica state. The replica state is a method of tracking if secondary nodes have died, and if there was an election of a new primary node.
  • Locking state. The locking state shows what data locks are set, and the length of time they have been in place.
  • Disk utilization. Disk utilization refers to disk access.
  • Memory usage. Memory usages refers to how much memory is being used, and how it is being used.
  • Number of connections. The number of connections the database has open in order to serve requests as quickly as possible.

Let’s delve into some of the details.

Replication lag

MongoDB uses replication to meet availability challenges and goals. Replication is the propagation of data from a primary node to multiple secondary nodes, as operations on the primary node change the data. These nodes can be co-located, in different geographic locations, or virtual.

All things being equal, data replication should happen quickly and without issues. Many things can happen that stop the replication process from executing smoothly. Even under the best conditions, the physical properties of the network limit how quickly data gets replicated. The delay between starting replication and completing it is referred to as replication lag.

In a smoothly running set of primary and secondary nodes (referred to as a “replica set”), the secondaries quickly copy changes on the primary, replicating each group of operations from the oplog as fast as they occur (or as close as possible). The goal is to keep replication lag close to zero. Data reads from any node should be consistent. If the elected primary node goes down or becomes otherwise unavailable, a secondary can take over the primary role without impacting the accuracy of data to clients. The replicated data should be consistent with the primary data before the primary went down.

Replication lag is the reason that primary and secondary nodes get out of sync. If a secondary node is elected primary, and replication lag is high, then the secondary’s version of the data can be out of date. A state of elevated replication lag can happen for several non-permanent or undefined reasons and correct itself. However, if replication lag stays high or starts increasing at a regular rate, this is a sign of a systemic or environmental problem. In either case, the bigger the replication lag – and the longer it remains high – the more your data is at risk of being out of date for clients.

There is only one way to analyze this metric: monitor it! This is a metric that should be monitored 24x7x365, so it is best done using automation and trigger warnings to alert DBAs or response system administrators as soon as it hits an undesirable threshold. The configuration for this threshold is dependent upon your application’s tolerance for replication delay. To determine the proper threshold, use a tool that graphs delay over time such as Compass, MongoBooster, Studio 3T, or Percona Monitoring and Management (PMM).

Replica state

Replication is handled via replica sets. A replica set is a set of nodes with an elected primary node and several secondary nodes. The primary node is the keeper of the most up-to-date data, and that data is replicated to the secondaries as changes are made to the primary.

Normally, one member of a replica set is primary and all of the other members are secondaries. The assigned status rarely changes. If it does, we want to know about it (usually immediately). The role change usually happens quickly, and usually seamlessly, but it is important to understand exactly why the node status changed, as it could have been due to a hardware or network failure. Changing between the primary and secondary states (also known as flapping) is not a normal occurrence, and in a perfect world should only happen due to known reasons (for example, during environmental maintenance like upgrading software or hardware, or during a specific incident such as a network outage).

Locking state

Databases are highly concurrent and volatile environments, with multiple clients making requests and initiating transactions that get performed on the data. These requests and transactions don’t happen sequentially or in a rational order. Conflicts can occur – for example, if transactions try to update the same record or document, if a read request comes during an update to data, etc. The way many databases deal with making sure data is accessed in an organized way is “locking.” Locking occurs when a transaction prevents a database record, document, row, table, etc., from being altered or read until the current transaction is done being processed.

In MongoDB, locking is performed at the collection or document level to prevent conflicts between concurrent transactions. Certain operations can also require a global database lock (for example, when dropping a collection). If locking occurs too often, it impacts performance by making transactions (including reads) wait for locked parts of the database to become available to read or modify. A high locking percentage is a sign of other issues in the database: hardware failure, bad schema design, badly configured indexes, not using indexes, etc.

It is important to monitor the locking percentage. You should know what an acceptable percentage is in regard to performance, and how long the percentage can be maintained before affecting performance. If performance degrades too much due to a high locking percentage, it can trigger a replicate state change through server unresponsiveness.

Disk utilization

Every DBA should monitor the available disk space on their database servers. Once a database uses up the disk space on the host, then that server comes to an abrupt stop. Proactively sizing data and monitoring log file sizes are great techniques for database sizing.

Often your database might need to grow automatically. In these cases, you need to guarantee that it doesn’t outgrow the hardware. Periodically reviewing disk space can help prevent unexpected database server stops, as well as locate poor design issues (like queries requiring a full collection scan).

Memory usage

Keeping all of your data in RAM speeds up database response times. But what does that mean, and how do you know when something is in RAM?

The way that your database uses memory can be somewhat unclear. A great deal of the memory a server uses is for the buffer pool (data). It can be difficult to find out which database uses the largest portion of the buffer pool memory, and even more difficult to find out which collections or documents are actually in the buffer pool memory. Knowing this information is useful when load balancing your database across multiple servers (via sharding), or identifying data that is optimal for consolidation into one server instance.

Using tools to determine which instances are using memory the most, and for what data, can help you to optimize your environment.

Number of connections

Database transactions are usually initiated by applications and processes through “connections.” The number of open connections can impact the performance of the database. In theory, once a transaction is complete, the connection should be terminated. In practice, however, many of the connections get left open. It is normal for a database to keep some connections alive to facilitate certain transactions, but if too many are left open it can limit the number available in the connection pool.

As a best practice, a database should keep connections open for the least amount of time necessary to complete a request. This allows a small pool of connections to service a massive number of transaction requests. Otherwise, application transaction requests will be stuck waiting for an open connection. You need to monitor the number of open connections in the database to verify that they are being closed, and that there is a healthy number of connections left in the pool for incoming requests.

Tools provided with MongoDB

Now that we know what we should monitor, the next question is how? Fortunately, MongoDB comes with some easy-to-use tools for monitoring server statistics.

mongostat

This utility provides global statistics on memory usage, replica set status, and more, updated every second (by default).

The mongostat utility gives you an overview of your MongoDB server instance. If you are running a single “mongod” instance, it shows you the statistics for that single instance. If you are running a MongoDB cluster environment, then it returns the statistics for the “mongos” instance. mongostat is best used for watching a single instance for a specific event (for example, what happens when a specific application request comes in). You can use this command to monitor basic server statistics:

  • CPU
  • Memory
  • Disk IO
  • Network traffic

See the MongoDB documentation on mongostat for specifics on usage.

mongotop

This utility provides collection-level statistics on read and write activity.

The mongotop command tracks the time required to complete read and write operations on a MongoDB server instance. It provides statistics on a per-collection level. mongotop returns values every second by default, but you can adjust the time frame as needed.

All the per-second metrics are relative to your server’s configuration, as well as the cluster architecture. For single instances run locally, and using the default port, all you need to do is enter the mongotop command. If you are running in a clustered environment with multiple mongod and mongos instances, you will need to provide a hostname and port number with the command.

See the MongoDB documentation on mongotop for specifics on usage.

rs.status()

This command provides the status of the replica set.

You can use the rs.status() command to get information about a running replica set. This command can be run from the console of any member of any set, and it will return the status of the replica set as seen by the member in question.

See the MongoDB documentation on rs.status() for specifics on usage.

sh.status()

This command provides the status of a sharded cluster.

The sh.status() command displays a report of the shard configuration when executed on a mongos instance. It also provides information on the chunks in a sharded cluster. By default, it doesn’t provide the detailed information if the total number of chunks is 20 or more.

See the MongoDB documentation on sh.status() for specifics on usage.

Eyes on MongoDB

Monitoring your MongoDB database is an important part of maintaining the health and performance of your database environment. By monitoring and gathering analytics in areas such as replication lag, replica state, locking state, connections, and disk and memory utilization, you can guarantee that you keep the database up and running. Through constant vigilance you ensure that you detect issues before they become catastrophic.

You’re only as good as the tools you have to use. Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MongoDB developed by Percona on top of open-source technology. Behind the scenes, the various graphing features use Prometheus (a popular time-series data store), Grafana (a popular visualization tool), mongodb_exporter (our MongoDB database metric exporter), and other technologies to provide database and operating system metric graphs for your database instances.

With PMM, you can graphically monitor replication lag, replica state, locking, disk and memory utilization, and connections for the MyISAM, RocksDB, and MongoRocks engines. Check out a demo here.

A technology expert with more than 20 years of expertise working with databases and technical training, Rick Golba is a solutions engineer at Percona. He specializes in helping customers understand their database issues and finding solutions to resolve them. Prior to Percona, he worked as a technical trainer for HP/Vertica.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.