Maximize the performance of your monitoring system

Do you have large numbers of devices to monitor? Cut your monitoring server some slack

When your monitoring server is gathering SNMP and performance data from hundreds or thousands of devices across the network, it can use all the help it can get. Follow these tips for a scalable and snappy system. 

  • Disk writes and reads are the top limitation of monitoring systems. Buy fast disks with large caches.
  • Do not use RAID-5 on your monitoring system's disks. The update penalty is huge unless the number of disks is very large.
  • Put databases on separate sets of disk spindles so that their reads and writes do not interfere with each other. Put your MySQL/PostgreSQL database on one set of physical disks, and your RRDtool performance data on another. And if it works within your budget, put the OS and application files on yet another set of physical disks.
  • Also consider moving your MySQL/PostgreSQL and RRDtool databases to dedicated server hardware.
  • Do not use LVM on your monitoring system's disks. It will slow disk access.
  • Although monitoring systems are not terribly CPU intensive, you should load up your monitoring hardware with plenty of RAM -- as much as your budget will allow.
  • Use of TCP protocols for performance data collection in a large environment can suck up resources on the server, and they are much slower than SNMP. Stick with SNMP for data collection wherever possible. Note that you should still feel free to poll TCP services on the monitored servers for service up/down status.
  • Are you monitoring Windows machines? The native Microsoft SNMP agent doesn't give out enough information and has been known to have stability issues. WMI is slow compared to SNMP for data collection. Install SNMP Informant to get more stability and more useful information from your Windows infrastructure via SNMP.
  • Only collect data on devices and interfaces that you really care about. You probably don't care about loopback interface stats, so don't collect them. Do you have devices with multiple interfaces? If you don't need performance data on some of those interfaces, then don't collect it.