In the real estate world, the mantra is location, location, location. In the network and server administration world, the mantra is visibility, visibility, visibility. If you don't know what your network and servers are doing at every second of the day, you're flying blind. Sooner or later, you're going to meet with disaster.
Fortunately, there are a plethora of good tools, both commercial and open source that can shine much-needed light into your environment. Because good and free always beats good and costly, I've compiled a list of my favorite open source tools that prove their worth day in and day out in networks of any size. From network and server monitoring to trending, graphing, and even switch and router configuration backups, these utilities will see you through.
[ Need a Linux that can boot from a pen drive, run in a sliver of RAM, rejuvenate an old system, or rescue data from a dead PC? See "Specialty Linuxes to the rescue." Read about the very best open source software products in InfoWorld's Best of Open Source Software Awards 2008. ]
First, there was MRTG. Back in the heady days of the 1990s, Tobi Oetiker saw fit to write a simple graphing tool built on a round-robin database scheme that was perfectly suited to displaying router throughput. MRTG begat RRDTool, which is the self-contained round-robin database and graphing solution in use in a staggering number of open source tools today. Cacti is the current standard-bearer of open source network graphing, and takes the original goals of MRTG to whole new levels.
Cacti is a LAMP/WAMP (Linux/Windows, Apache, MySQL, and Perl/PHP/Python) application that provides a complete graphing framework for data of nearly every sort. In some of my more advanced installations of Cacti, I'm collecting data on everything from fluid return temperatures in datacenter cooling units to free space on filer volumes to FLEXlm license utilization. If a device or service returns numeric data, it can probably be integrated into Cacti. There are templates to monitor a wide variety of devices, from Linux and Windows servers to Cisco routers and switches -- basically anything that speaks SNMP. There are also collections of contributed templates for an even greater array of hardware and software. I've written several data templates for Cacti that can be downloaded from the project site, including the FLEXlm monitoring code.
Cacti's default collection method is SNMP, but local Perl or PHP scripts can be used as well. The framework deftly separates data collection and graphing into discrete instances, so it's easy to rework and reorganize existing data into different displays. Not only that, but you can easily select specific timeframes and sections of graphs just by clicking and dragging. In some of my installations, I have data going back several years, which proves invaluable when determining if current behavior of a network device or server is truly anomalous or, in fact, occurs with some regularity.
Using the PHP Network Weathermap plug-in for Cacti, you can easily create live network maps showing link utilization between network devices, complete with graphs that appear when you hover over a depiction of a network link. In many places where I've implemented Cacti, these maps wind up running 24x7 on 42-inch LCD monitors mounted high on the wall, providing the whole IT staff with at-a-glance updates on network utilization and link status.
Cacti is extremely well written, well presented, and infinitely customizable. There really is no comparison to this tool in either the open source or commercial world.
Nagios is a surprisingly mature network monitoring framework that's been in active development for many years. Written in C, it's just about everything that system and network administrators could ask for in a monitoring package. The Web GUI is fast and intuitive (although it's even better with the contributed Nuvola style), and the back end is extremely robust.
As with Cacti, there is a very active community supporting Nagios, and plug-ins exist for a massive array of hardware and software. From basic ping tests to integration with plug-ins like WebInject, you can constantly monitor the status of servers, services, network links, and basically anything that speaks IP. I use Nagios to monitor server disk space, RAM and CPU utilization, FLEXlm license utilization, server exhaust temperatures, and WAN and Internet link latency. I even use it to ensure that Web servers are not only answering http queries, but that they're returning the expected pages and haven't been hijacked.
Network and server monitoring is obviously incomplete without notifications. Nagios has a full e-mail/SMS notification engine, and an escalation layout that can be used to make intelligent decisions on who and when to notify, which can save plenty of sleep if used correctly. In addition, I’ve integrated Nagios notifications with Jabber, so the instant an exception is thrown I get an IM from Nagios detailing the problem. The Web GUI can be used to quickly suspend notifications or acknowledge problems when they occur, and can even record notes entered by admins.
As if this wasn't enough, a mapping function displays all the monitored devices in a logical representation of their placement on the network, with color-coding to show problems as they occur.
The downside to Nagios is the configuration. The config is best done via command line and can present a significant learning curve. As with many tools, the capabilities of Nagios are immense, but the effort to take advantage of some of those capabilities is equally significant. '
But don't let the complexity discourage you -- Nagios has saved my bacon more times than I can possibly recall. The early-warning systems provided by this tool for so many different aspects of the network cannot be overstated. It's easily worth the time investment. I've written several Nagios plug-ins, including one that monitors a wide variety of APC hardware, and they've paid me back many times over.
If you've ever had to search for a device on your network by telnetting into switches and doing MAC address lookups, or you just wish that you could tell where a certain device is physically located (or, perhaps more important, where it was located), then you should take a good look at NeDi.
NeDi is a LAMP application that regularly walks the MAC address and ARP tables on your network switches, cataloging every device it discovers in a local database.
You can then log into the NeDi Web GUI and conduct searches to determine the switch and switch port of any device by MAC address, IP address, or DNS name.
In addition, NeDi collects as much information as possible from every network device it encounters, pulling serial numbers, firmware and software versions, current temps, module configurations, and so forth. You can even use NeDi to flag MAC addresses of devices that are missing or stolen, and NeDi will watch to see if they appear on the network again.
Configuration is straightforward, with a single config file that allows for a significant amount of customization, including the ability to skip devices based on regular expressions or network-border definitions. You can even include seed lists of devices to query if the network is separated by nondiscoverable boundaries, as in the case of an MPLS network. NeDi usually uses Cisco Discovery Protocol or Link Layer Discovery Protocol, discovering new switches and routers as it rolls through the network, then connecting to them to collect their information. Once the initial configuration has been set, running a discovery is fairly quick, and runs from cron at set intervals.
NeDi also integrates with Cacti to some degree, and if provided with the credentials to a functional Cacti installation, device discoveries will link to the associated Cacti graphs for that device.
Ntop is the product of a fantastically focused mind -- that of Luca Deri, the project's author. Ntop is a top-notch network traffic monitor married to a fast and simple Web GUI. It's written in C and completely self-contained; you run a single process configured to watch a specific network interface, and that's about all there is to it.
Ntop provides easily digestible graphs and tables showing current and past network traffic, including protocol, source, destination, and history of specific transactions as well as the hosts on either end. Ntop leverages the aforementioned RRDTool to provide an impressive array of network utilization graphs, including trends, and incorporates a plug-in framework for an array of add-ons, such as NetFlow and sFlow monitors.
Ntop even has an RPC framework that can be used to provide native data arrays to a wide variety of languages. If you wanted to consistently reference a specific set of packet capture data from Perl or PHP, for example, it's as simple as referencing a native array exported from Ntop at the time of the procedure call. I've found this infinitely useful in a wide variety of applications.
One of the main uses of Ntop is on-the-spot traffic checkups. When one of my Cacti-driven PHP Weathermaps suddenly shows a collection of network links running in the red, it tells me that those links exceed 85 percent utilization, but it doesn't tell me why. By switching to an Ntop process watching that network segment, I can quickly pull a minute-by-minute report of the top talkers and immediately know which hosts are responsible and what traffic they're pushing.
That kind of visibility is invaluable, and it's very easy to come by. Essentially, you can run Ntop on any interface that's been configured at the switch level to monitor another port or VLAN. That's really it.
Pancho is a simple Perl script that reaches out to Cisco routers and switches and pulls down a current copy of the running configuration. When run at set intervals, it allows admins to keep instant backups of router and switch configurations, which can be terribly valuable when things go pear-shaped and nobody thought to write down some specific configuration information for an edge router.
Pancho hasn't been under active development since 2005, but that hasn't been a problem so far. In fact, barring fundamental changes in Cisco IOS, Pancho's latest and last release may be completely functional for years to come.
There's not really much more to say about Pancho. It takes all of five minutes to configure and use, and as long as you properly secure the downloaded configurations, there's very little risk involved. In a nutshell, you risk more by not using Pancho.
The Snort IDS has been available as an open source tool for 10 years now. In fact, it was so successful that it developed into a viable commercial tool with support from Sourcefire, but the open source version is still actively developed and available.
In either the commercial or open source flavor, Snort is a very complete intrusion detection system that watches and catalogs network traffic, matching that traffic against predefined rules to monitor network segments for nefarious activity. In fact, it can do much more, since rules can be written to flag traffic that matches any criteria. If you want to check all IM traffic exiting the network that matches a specific internal product code name, that's certainly possible, right along with standard rules that watch for port scans, virus activity, and so forth.
When coupled with the BASE (Basic Analysis and Security Engine) Web GUI, Snort becomes an even more powerful tool. When Snort is configured to log to MySQL, BASE can pull reports on alarm triggers and display traffic anomalies based on source or destination IP address, TCP or UDP port number, and alert type. In addition, if you have multiple Snort sensors in various places on the network, they can all log to the same database, and BASE can produce reports incorporating any or all of those sensors.
The best part is that a Snort sensor doesn't have to be anything special. In most networks, it can easily be built on a low-end desktop- or server-class system, depending on traffic levels. The basic rule sets are available for free from Sourcefire with registration, and rules updates are easily managed. And if you want to go with a supported solution, you can buy the official commercial counterpart from Sourcefire. In either case, Snort can quickly become an invaluable addition to any network.
Too often, IT administrators think that they can't color outside the lines. Whether it's a custom application or an "unsupported" piece of hardware, there are many of us who believe that if a monitoring tool can't handle it immediately, it can't be handled. That's simply not the case, and with a little bit of elbow grease, just about anything can be monitored, cataloged, and made more visible.
An example might be a custom application with a database back end, like a Web store or an internal finance application. Management wants to see pretty graphs and charts depicting usage data in some form or another. If you're using something like Cacti already, there are several ways to bring this data into the fold, such as constructing a simple Perl or PHP script to run queries on the database and pass counts back to Cacti, or even an SNMP call to the database server using private MIBs (management information bases). It can be done, and it can generally be done easily.
For unsupported hardware, as long as it speaks SNMP, you can most likely squeeze the data you need out of it with a little research. Once you have the right MIBs to query, you can then use that information to write a Nagios plug-in to monitor the device. An example might be my Nagios plug-ins for APC hardware -- they didn't exist when the hardware was installed, but I wanted to centralize the monitoring of those devices. I wrote a quick plug-in to check the PDUs (power distribution units) for amperage levels, the in-row cooling units for airflow and rack inlet temperatures, and so forth. Now, not only do I have that data in graphs via Cacti, but Nagios watches the same data, looking for anomalies and reporting to me via IM, e-mail, and even SMS if the numbers are out of whack.
Getting most of these tools running isn't much of a challenge. On a freshly built CentOS box, all you need to do is install the proper repository RPM from RPMForge, then type "yum install nagios ntop cacti," and Nagios, Ntop, and Cacti will download and install. Configuring the tools can take quite a while depending on the size of the infrastructure, but getting them going is a cinch. At the very least, it's worth a test-drive.
No matter what tools you use to keep tabs on your infrastructure, the fact that those tools exist essentially provides the equivalent of at least one more IT admin -- one that can't necessarily fix anything, but one that watches everything, 24/7/365. The up-front time investment is well worth the effort, no matter which way you cut it. Just be sure to run a small set of autonomous monitoring tools on another server, watching the main monitoring server. This is a case where it's always best to ensure that the watcher is being watched.