As powerful as modern infrastructure technology has become, there's no denying it's grown more complex and interdependent. As much as these new technologies have made life in IT easier and more efficient, they also have created a new class of difficult-to-sort-out failures -- some that can sit dormant for months or years before they're detected.
In the past, a typical enterprise data center might have consisted of many servers, some top-of-rack and end-of-rack network switching gear, and a few large storage arrays. Dependencies in that sort of environment are clear. The servers rely on the availability of the network and the storage they're addressing. The network and storage (and its associated network) don't depend on much beyond themselves.
Today, the picture is quite different. There are still servers, of course, but they might be blades in a blade chassis that includes a built-in converged network fabric enabling connectivity both to the LAN and to storage. The storage then attaches directly to that fabric. Beyond that, some critical functionality of the converged network might be implemented in software running on the server blades. More complex still, if IP-based storage is used, simple access to that storage might depend on everything else working.
It's all too easy to allow a circular dependency to be built into such a system without realizing it. If you're particularly unlucky, you'll find out that you have that flaw only after a lot of other things have gone wrong. The only way to truly avoid such circular dependencies is to spend a lot of time reading documentation, charting interdependencies, and -- above all else -- testing.
A real-world example
Although I've seen a wide array of various problems of this sort, the best example I can think of involved Cisco's Nexus 1000V virtual switch in an EMC VMware vSphere environment. Right off the bat, I will say that I'm a huge fan of software-defined networking, and although it's not perfect or the only answer, the Nexus 1000V is a great product I've used many times. However, it's also quite a bit different than implementing a physical switch, and it has a wide array of external and internal dependencies.
In this example, the vSphere hosts had been configured with two copper 1Gbps NICs for front-end management traffic and two traditional (non-nPAR/CNA) 10Gbps NICs for virtual machine network access and access to the enterprise's NFS-based storage.
For those of you not familiar with it, the Nexus 1000V has two basic components: the Virtual Supervisor Module (VSM) and a collection of virtual Ethernet modules (VEMs). The VSM takes the role of a supervisor module in a modular switch, and the VEMs take the role of line cards. The control and management planes are implemented in the VSM, but the data plane is almost exclusively switched by the VEMs.
From a practical perspective, the VSM is implemented as a virtual machine appliance (with an optional secondary appliance for high-availability purposes) that runs on the hosts. The VEMs are software modules that install in the vSphere hypervisor on each host. Of course, communication between the VSM and the VEMs is critically important because the VEMs don't really know what to do or how they should be configured without the VSM to tell them. Clearly there's a dependency. There's also a strong dependency between the VSMs and VMware vCenter, which coordinates the activities of the vSphere hosts.
Without communication between the VSM and VEMs, the VEMs won't know how to switch traffic. And without communication between the VSM and vCenter, no virtual machine networking configuration changes can take place (triggered from either side). That's much more complicated than having a couple of external physical switches, but it seems manageable.
In this deployment, I did a few crucial things incorrectly -- and no one noticed until it was too late. That moment turned out to be a facilitywide power outage that took place on a holiday. Not long after power was restored, it became evident that things were not working properly. It took the next eight hours to figure out why and fix the problem.
What it eventually boiled down to was two critical oversights in tracking and planning for dependencies. The first was that the Nexus 1000V was tasked to operate the 10Gbps NICs in the vSphere servers -- the same NICs used to allow access to storage where virtual machines were stored. In what I can only imagine was a moment of distraction, I imported the Nexus 1000V VSMs onto SAN storage when building the infrastructure and totally forgot to move them onto local storage later.
What this meant is that the VSMs could not start, because they were located on storage that couldn't be accessed without the VEMs being active -- which in turn could not happen because the VSMs weren't up. Until that was corrected, the VSMs couldn't be brought up. Until they were up no other VMs could come up, either.
Once that problem was overcome (requiring some kludgey use of the front-end NICs to access storage), there were still problems. Although the vCenter VM had been placed on a basic vSwitch on the 1Gbps NICs (and so didn't depend on the Nexus 1000V), the Oracle database server it depended on was located on a VM that, in turn, depended on the 1000V. Worse, the VM couldn't be moved to the 1Gbps NICs because it contained databases used for production purposes that required access to the 10Gbps NICs. Although it was temporarily reconfigured to get things working, the database eventually had to be migrated to a different VM.
The two key lessons regardless of the specific technology involved
This incident taught me and everyone else involved some basic tips on how not to configure the Nexus 1000V in a production environment. (If you're curious, there are many ways to avoid all these problems by having the 1000V and vCenter components running outside the environment they manage.) However, anyone can draw larger lessons from this -- whether or not you ever touch the 1000V switch.
The first key lesson is that being methodical and carefully examining the configuration you've deployed before you press it into production is critically important. It's very easy to say, "Oh, yeah, I'll fix that later," but with the breakneck pace of life in IT these days, will you actually do it? What I've done since that incident is keep a running list of the shortcuts I know I'm taking when I work on a project, then cross them off as I fix them later in the project. Otherwise, it's far too easy to leave a rotten Easter egg behind without realizing it.
The second key lesson is that testing is critically important. In this case, we would have had to do a full-scale power down and power up of the infrastructure before it got too far into production (after some point, you won't be allowed to shut the whole thing down for a test!). Sure, that seemed like a waste of time to us in the deployment phase, but for everyone who spent most of a holiday at work troubleshooting a problem that could have been avoided by doing the power-down test, it wouldn't have looked like wasted time in retrospect.
It boils down to paying attention. Although that's arguably no more important today than it ever has been in the life of IT, the consequences of not paying attention are substantially magnified by today's more complex and more interdependent infrastructure. The chances that a simple mistake can disable an entire infrastructure versus just a single component are far greater today than ever before. As data center infrastructure continues to blur and converge, that problem will only get worse.
This article, "The modern data center's hidden risks: 2 key lessons," originally appeared at InfoWorld.com. Read more of Matt Prigge's Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.