Recently, I was faced with an interesting virtual networking problem: how to allow a large virtualization environment to be failed over to a recovery data center and thoroughly tested, without impacting the production network. As is often the case, my quest for an elegant solution took me down the rabbit hole and led to consideration of several new virtualization networking technologies. In the end, I didn't achieve exactly what I set out to, but I did catch a glimpse of what the future might hold.
In this case, a large VMware vSphere-based virtualization infrastructure had been configured to fail over to a duplicate secondary data center elsewhere on the same campus, the entirety automated through the use of VMware's Site Recovery Manager (SRM).
[ Server virtualization is far from a simple task. InfoWorld's expert contributors show you how to get it right in this 24-page "Server Virtualization Deep Dive" PDF guide. | Sign up for InfoWorld's Data Explosion newsletter to help deal with growing volumes of data. ]
Since both data centers had visibility to the same mutually redundant core route switches and associated layer two (L2) networks, an actual failover would be relatively seamless from a networking perspective, involving none of the dynamic routing or, worse, re-addressing that might be required in a typical geographic site failover. If the production data center were to become unavailable for whatever reason, SRM could simply bring the virtual machines back up at the secondary data center in virtual networks configured identically to those in the primary data center, and life would continue without any real changes to the network.
In addition, simply creating a separate set of L2 VLANs wouldn't allow virtual machines in one VLAN to speak to those in other VLANs. For that, I would need layer three (L3) routing. Fortunately, the existing core switches supported the definition of VRFs (Virtual Routing and Forwarding). By configuring a new VRF on the core switches, I could partition a subset of the core L3 VLAN interfaces into a "partitioned" router with its own route table, thus allowing identically addressed interfaces to exist for the production and test environments without allowing them to see or conflict with each other.
While that solution would undoubtedly work, it would add a substantial amount of complexity to the network, including the definition of new VLANs in the secondary data center and the core, as well as a substantial amount of new routing configuration within the core. There had to be a better way.
Solution No. 2: VXLANs and a virtual router
Not all that long ago, Cisco, VMware, and several other partners spearheaded the creation of a new network overlay protocol standard called VXLAN (Virtual eXtensible Local Area Network). VXLAN seeks to fix many of the scalability and provisioning problems that large cloud infrastructures experience -- namely, dealing with the need to constantly provision new VLANs on both physical and virtual networking gear to support isolated customer networks and addressing the exhaustion of available VLAN IDs (stored in a 12-bit integer supporting a max of 4,096 VLANs).
VXLAN provides a solution to those challenges by tunneling L2 traffic in between virtualization hosts within L3 packets -- in a sense, creating a set of completely isolated network segments that only are visible within the virtualization environment and that require little to no configuration outside of it. Additionally, VXLAN also utilizes a much larger integer for storing the so-called Virtual Network ID (or VNI), allowing the creation of millions of VXLAN segments within a shared infrastructure. Though I wasn't concerned with creating millions of networks, utilizing VXLAN in the place of defining new VLANs would nearly eliminate any configuration changes to the physical network devices and would be easy to extend down the line -- exactly what I wanted.
Solution No. 3: VLANs and a virtual router
In the end, I simply broke down and created a duplicate set of VLANs on the physical networking gear and configured the secondary data center's virtual networking gear to talk to them. However, instead of creating a new VRF on the core switches to route the VLANs, I used a virtual appliance-based router. The difference here is that the router would not have to act as a VXLAN endpoint; it'd just have to understand tried-and-true 802.1q VLAN tagging -- which enjoys nearly ubiquitous support.
I configured the virtual router with a single virtual NIC, which is in turn configured to be a VLAN trunk; this is called Virtual Guest Tagging or VGT -- the missing piece in the VXLAN puzzle. That NIC can then have multiple virtual interfaces configured on it within the virtual router (one for each VLAN), which allow it to act as the default gateway for a relatively arbitrary number of different networks at the same time. Though the performance and redundancy of this kind of configuration might not be sufficient for some production environments, it's perfectly suited for this kind of test environment.
This compromise wasn't exactly what I wanted, but it was the best I could come up with and took me through the interesting process of learning some of the mechanics and current limitations of the available VXLAN implementations.