On its face, it looked like the L3 switches were claiming every IP on that subnet for themselves. Naturally, this could cause problems and was causing problems with ESXi. But remember -- Linux and Windows hosts on the same segment had the right MAC addresses in their ARP table, only the ESXi boxes had the VRRP MAC address in their tables for other hosts on the segment.
This caused me to wonder why on earth the VRRP spec would include a provision for proxy ARP, which then immediately led me to realize that it doesn't -- but proxy ARP is designed to do exactly that. A brief check of the bowels of the L3 switch showed that proxy ARP was enabled for that VLAN and only that VLAN. (Full disclosure: I'd enabled it for some tests a few weeks ago since that lab segment is rarely used. In my advancing years, I'd simply forgotten I'd done it.) The reason that only ESXi exhibited this problem is that while Linux, Windows, and most other OSes place the first ARP is-at reply into their ARP tables, ESXi chooses the last response. Since proxy ARP is artificially generated, it's usually a few milliseconds behind the actual ARP is-at response from the host itself.
The moral of this story is that had I jumped on the packet traces right away, the fix would have been apparent much sooner. Another moral is that VMware should really change the ARP table population code in ESXi to conform with all other modern OSes and discard all but the first ARP reply. After all, it does provide some form of protection against ARP cache poisoning, and the first response is usually the closest and most accurate.
If you're reading this and haven't spent much time digging into what's actually happening on the wire, maybe it's time to download Wireshark, take a snapshot of some traffic, and go through it to familiarize yourself with what goes on in the depths of an Ethernet network. The time you spend doing so will be repaid with interest down the line.
Basic networking has become so "easy" that most people view it as a dark art, though it holds the key to solving myriad networking problems. The few times that I've forgone pulling tcpdumps when troubleshooting network problems have been the times that the answer would have been right in front of my face.
This story, "The lost art of reading packet traces," was originally published at InfoWorld.com. Read more of Paul Venezia's The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.