It could be called the “ignomoment;” the split second following a definitive action when you realize you've just made a tragic mistake. For network administrators, this means the difference between going home at 5 p.m. or 5 a.m. The truth is, despite incidents of careless backhoe drivers pulling up fiber bundles or hurricanes bringing down the power lines, administrator error is the most common reason that a network fails.
“There’s always a reason why something doesn’t work,” says Richard Willmott, market manager at IBM Tivoli. “Finding that reason is the hard part.” Too often network administrators are up against a wall, lacking the budget, lab, and time necessary to determine fully the ramifications of modifications made to an internal routing protocol configuration, or accurately determine the impact of a large-scale access list modification on live traffic. Although there is no way to eliminate human error, there are certainly ways to abridge its effects.
A solid change-management process, along with proper training and sufficient IT resources, can turn that sinking feeling brought on by disparate systems and outdated tools into guarded confidence. Then there’s ITIL (IT Infrastructure Library), which is a collection of best practices for IT management. It describes in detail the steps necessary to institute various management practices to reduce problems and gain visibility into network infrastructures. Lastly, there are plenty of vendors whose products aim to streamline and automate the change-management process. Nothing is fail-safe, but that’s no excuse for not trying.
Software developers have the edge when it comes to testing and implementing changes. It’s all but unheard of to find developers writing and distributing code without any form of testing. A lab for a developer can be a laptop, and a full-scale software development lab infrastructure can be had for the cost of a few servers.
Yet, for the devices delivering the signals, changes of any scale are typically undertaken without the benefit of prior testing. Why? Because it’s nearly impossible to test every aspect of proposed network configuration changes thoroughly. Rather than simply requiring a few servers for a development environment, a network lab requires a wide variety of expensive network hardware to truly mimic the production environment. This means simulating TDM circuits and frame-relay networks, ISDN lines, and any other link types in use on the production network. Simple tests can be accomplished with a subset of the production gear but the costs are high and confidence that the proposed change will function as expected can waver.
For many infrastructures, there are two paths available to deal with this problem. One is a lab environment that can simulate portions of the network; the other is strong change-management policies and change-management software to back up those policies. It’s one thing to inadvertently cause network disruptions, it’s quite another to realize that you have no backup of the functioning configuration and must replicate detailed parameters from human memory or outdated configurations.
Commercial products are available to help and, fortunately for buyers, this space is hotly contested. Several vendors offer product suites that claim to assist in maintaining policies across disparate network devices and performing automated configuration backup, searching, and restoration. AlterPoint’s Device Authority Suite offers a complete network development environment patterned on the Eclipse IDE that features extensive automatic scripting tools to develop configuration changes and push them to selected network devices.
Such tools form the core of any change-management initiative. Without proper methods to develop, deploy, maintain, and verify configuration policies across dozens or hundreds of devices, all the procedures in the world will not make a difference. Enterprising carriers and datacenter operators have even integrated help desk software and change window identification to speed the process of linking problems to recent changes and to assist in network troubleshooting. If the framework for thorough change management is available, capitalizing on integrations such as this will definitely help admins sleep at night.
Now This Won’t Hurt a Bit
For many network administrators, initiating network change management is like a trip to the dentist — necessary but dreadful. Generally, network configuration changes are slight (an addition to an access list or a change to an SNMP community, for example) and require only seconds to implement. Navigating through onerous change-management guidelines can sometimes seem to complicate seemingly straightforward tasks. Abiding by the guidelines pays off, however. If you don’t summon the courage to see the dentist, problems only get worse; it’s a similar situation with configuring networks.
Enterprises can learn much about managing network configurations from service providers, whose very livelihood depends on handling changes smoothly, quickly, and accurately. Nearly all large ISPs have rigorous change-management procedures in place and back those up with thorough configuration management tools. Some ISPs aren’t as dutiful and it can show.
Recently, I assisted during a network outage of a multistate MPLS (multiprotocol label switching) network. Although I was not privy to the carrier’s network, I was on the call with the NOC (network operating center) administrator looking into the problem. Due to the high-level nature of private MPLS networks, carriers have a much greater impact on the performance and reliability of the service. Where a traditional frame-relay network functions at layer 2, MPLS networks function at layer 3, and the carrier is responsible for maintaining valid routes across all POPs (points of presence). Thus, when a network failure occurs and all network links are active, the problem may lie within the carrier’s routing tables. Such was the case here. The problem was eventually traced to a change made in a router thousands of miles from the furthest point of this network, where routes were erroneously injected into the routing tables for my client’s MPLS network. The failure was triggered by a seemingly innocuous change to a routing table with no relation to the unintentionally affected network and, until someone contacted the tech who made the change, no one knew it had been made. It took three hours to identify and fix the problem. If an effective change-management policy had been in place, the problem could probably have been averted.
Following the ITIL Framework
Increasingly, effective change-management policies follow the ITIL framework.
In terms of change management, ITIL lays out a foundation based on a CMDB (change management database). The CMDB can take any form, from a simple Microsoft Access database to a fully fleshed-out SQL-driven solution. A CMDB can also be sourced from a vendor, such as Troux Technologies’ Troux CMDB for ITIL product. There are also a few hosted CMDB services that are best suited to small businesses, such as myCMDB.com.
The CMDB database contains documentation on all the moving parts of any infrastructure and provides a framework for the modification of this data when changes are instituted.
The process of creating and maintaining a CMDB starts quite simply: define and document the critical infrastructures within your network. This can be a slippery slope and anyone with a stake in a particular application will claim that it is highly critical, so some discretion is required. Good examples of critical infrastructures would be security systems, payroll databases, shipping and inventory systems, and interfaces to external partners. These systems need to be documented from the ground up, with all the data residing in the CMDB.
One of the big concerns with the CMDB approach is determining a suitable level of detail, which is why identifying critical infrastructures is so important. It’s certainly feasible to document every aspect of the network in excruciating detail, but that may be counterproductive. Limiting the extent of data within the CMDB can prevent drowning under a sea of meaningless data while searching for the necessary elements that affect an existing problem. For instance, the network-management tools described above can participate in the CMDB, but it may not be necessary for device configurations to find their way to the CMDB. External resources may best handle such intricate low-level detail, with the CMDB providing detail on the overall function and purpose of that infrastructure.
Most configuration management vendors support the concept of change management. Although there is a distinction between the two, they are mutually inclusive. One aspect of configuration management that can clearly assist the change-management effort is policy management and adherence verification. Rendition Network’s TrueControl, for instance, can generate a report verifying that every Cisco 2950 switch on the network has an identical access list in place, and has TCP small servers disabled. Further, policies can be created and configuration changes pushed to like devices simply, reducing the chance of human error affecting network resources. Of course, this also introduces the potential for the amplification of a single mistake across the network.
Baby Steps First
Of course, best practices guidelines can only help if the infrastructure is stable; there’s no sense in erecting scaffolding on a burning building. Instituting a short freeze on any new projects is a good way to achieve stability. This tactic will undoubtedly cause some short-term problems, but the long-term ROI is well worth it. Once the staff is no longer tasked with rush implementations of new systems, stability will increase. Then begin the change-management process by identifying the critical infrastructures, developing the database, and investigating the tools that can automatically update the database. Most IT shops keep track of various network elements in small ways, such as an Excel spreadsheet of VLAN locations or external IP numbering assignments. Migrating this data to a central repository is a simple first step. As with any systemic change, starting small and gaining early victories will pave the way for success when the project becomes more challenging.
After a CMDB is in place, several tools can assist in maintaining it. Help desk software that interfaces with a CMDB can provide valuable information to help desk engineers and simplify the movement of information but may be too complex to implement immediately. Ensuring that the chosen software is accurately populating and retrieving data from the CMDB is far more valuable than a quick implementation. The CMDB is only worthwhile if it is accurate. Thus, the age-old IT mantra of “plan, plan, plan, and implement” is the best course of action when introducing any form of change management. Small steps get the best results.
Uniting minds over the concept of change management is an important precursor to a change management implementation. Until everyone believes in the benefits, there’s little incentive for them to use the system. If the policies are given a positive spin and those in positions of network responsibility see positive results, change management will catch on quickly.