The trouble with network monitoring is that the more you know, the more you find you need to do. Some shops will make like an ostrich and do the bare minimum so that they can plead ignorance, while other shops use only what the vendor tells them they need to use. I wanted a solution that neither tied me to a single manufacturer nor hid its head in the sand. At the same time, I demanded a tool that wouldn't blithely send alert blasts I then had to sort through, but would put those alerts into context.
It turns out the solution I was looking for is ScienceLogic's EM7. It's a carrier-class monitoring system capable of scaling to truly staggering proportions, but still appropriate (and affordable enough) for my smaller network at the University of Hawaii School of Ocean and Earth Science and Technology's (SOEST) Research Computing Facility.
[ Read about the year's best products in InfoWorld's 2013 Technology of the Year Award winners slideshow. | For the latest practical data center advice and info, check out Paul Venezia's Deep End blog and InfoWorld's Data Center newsletter. ]
Although it has taken a while for me and the IT support group to get comfortable with the ScienceLogic EM7 system, its ability to create a multitenant environment will soon allow us to hand off monitoring duties to the individual labs that run the HPC clusters instead of forcing us to add more personnel. Plus, EM7 lets us easily build monitoring support for equipment that lacks MIBs (management information bases). And like many distributed organizations, we preferred to avoid purchasing a complete setup for every location, yet still wanted the ability to monitor remote sites through firewalls and over WAN links. To this end we placed the EM7 collectors into key locations using either physical appliances or virtual machines. The collected data is forwarded to the main database en mass instead of clogging up our WAN pipes with constant SNMP traffic.
Built for big networks
One point I want to make very clear is that EM7 should not be compared to WhatsUp Gold or other network monitoring systems designed for the single enterprise. This is a carrier-class system that was born from engineers working with national and international carriers that needed enough flexibility to handle hundreds of entities and a similar number of connected networks. The fact that EM7 can use national weather service map overlays to put potential trouble spots into perspective gives you a good idea of the system scope the Science Logic folks are used to. Even so, keep in mind that I've been using EM7 for the past year or so in a single college on a single campus at a single university. Pricing is a function of the number of systems you're monitoring.
In other words, the EM7 pricing structure doesn't differentiate between a flat enterprise network with 1,000 devices to monitor as opposed to our network that includes dozens of labs with projects behind their own firewalls. All of this sits on a collection of NAT Class C IPv4 subnets -- some with public addressing and a smaller number making the transition to IPv6. If you have projects behind NAT firewalls, you can put in a virtual machine collector that feeds systems information to the main EM7 database for the same price as a flat network with a single collector. If you're not virtualized yet, the database and collectors are also available as a physical appliance. We make use of both physical and virtual appliances.
Initial setup is quite easy, though EM7 has its own way of doing things. At first, I was confused over where to find features and the use of a whole new terminology. Ultimately, the logic of EM7's naming conventions seems to come back to the multitenant nature of the solution.
For instance, the "registry" handles devices, device groups, networks, users, and just about anything you might typically call "assets," but with a multitenant twist. The same goes for "run book" (I might have used "action items"), which is a collection of items the system will run for notifications, scripts, and cascading actions. The run book is where the automation lives, based on scripts contributed by Science Logic and the EM7 user community. My favorite is an automation script we employ in the InteropNet NOC that uses SNMP put commands to turn on a power socket that flashes either red or blue flashing lights to indicate major or critical alarms.
The customizability of EM7 is a huge differentiator. After a year of learning the system, the team at Interop 2011 went to town with a wide variety of dashboards created both by the Science Logic team and by the InteropNet crew. One was designed to fit the massive 55-inch monitor in the NOC, and it allowed the network operations team to keep an eye on the status of various equipment groups from across the room. Because EM7 dashboards are customizable on an individual basis, we even created HTML5 dashboards so that the team could use an iPad to watch key components. (EM7 dashboards are currently a mix of HTML5 and Flash, but Science Logic is migrating more and more of the widgets to HTML5 and away from Flash.)
More signal, less noise
EM7 provides similarly fine-grained control over alerting. Setting up the InteropNet involves putting together a large number of pieces in a short amount of time. To speed the process, the EM7 crew would discover assets and put them into maintenance mode in order to ignore errors as we tested parts of the network during construction. At one point, we knew we had a bunch of optical errors as the fiber crew cleaned and connected more of the show infrastructure, but we didn't want to be overwhelmed by alerts and flashing lights. EM7 allowed us to change the alerting so that small numbers of minor errors would be ignored, but increasing numbers of minor errors over a short time would automatically escalate from minor to major to critical.
The happy result: We weren't notified of optical errors until the routers started load balancing over both our major WAN links. Minor errors were ignored until they started skyrocketing due to the increase in the amount of traffic on our secondary link. This escalated alert got our attention and forced us to look for missed optical path issues. It turned out someone had plugged in a dirty fiber cable that drove some grit into the optics of a 10G interface module. The escalated alert gave us enough time to replace the damaged components before the open of the exhibit floor.
EM7 has so many features and capabilities, it's hard to do it justice. Setup is easy, and the initial learning curve isn't too painful. It scales like crazy and distributes the load across lots of collectors and database appliances, physical or virtual. You can add any number of collectors -- one for every subnet, if you want -- as part of the license. Best of all, EM7 can be extended through templates and scripting to do just about anything you could possibly want from a management system, even if you don't have a full MIB for the device. In nearly two years of running this system, I haven't found anything it can't manage.
We chose EM7 at the University of Hawaii SOEST because we wanted each research lab to be able to handle its own monitoring and trouble ticketing, instead of managing it at the college or campus level. But along with the multitenant management and billing capabilities, as well as the staggering granularity of delegation, Science Logic EM7 has everything needed to manage your entire enterprise from soup to nuts. Its ability to correlate events across a huge variety of platforms, provide context-sensitive views, automatically generate trouble tickets, and flexibly scale without breaking the bank is simply a game changer.
ScienceLogic EM7 at a glance
|$15 per monitored device. Volume discounts available.||Deployed via dedicated appliances or VMware virtual appliances. Monitors any SNMP device I've found and does a great job with Windows-specific instrumentation too.|