Life in IT is full of onerous tasks. Along with making good backups and maintaining a solid patching regimen, you must ensure that multiple levels of antimalware software are properly deployed. Unfortunately, in heavily virtualized environments, antivirus can go beyond being a pain to manage and actually become a threat in and of itself. As the saying goes, sometimes the cure is worse than the disease.
That antivirus software can slow down a machine probably comes as no surprise to anyone. Any software that watches each and every disk I/O and inspects it for threats adds overhead that didn't previously exist. In most cases, this manifests itself through marginally higher disk latency and greater CPU load. But with careful use of scanning exclusions (for heavily used databases and the like), it's usually not enough to bring a system to its knees.
Recently, however, I've been presented with two excellent examples of how antivirus run amok can have enormous sitewide impact -- and how it can be difficult to detect the cause unless you know to look for it and have the monitoring data necessary to do so.
The new VDI environment
Inspection of the SAN itself showed no unusual events or failures. It seemed at first the problem might be isolated to the database cluster that experienced the problem, but since the issue was no longer in play, it was difficult to figure out what had actually happened. Worse, historical performance logging wasn't available from the SAN infrastructure, so further investigation of the occurrences from the SAN's perspective wasn't possible.
Since the physical database cluster shared physical disk resources with a large virtualization infrastructure (which had extensive performance reporting capabilities), attention turned there to see if the virtualization environment also saw higher latency during the failure. Indeed it had -- with incredible latency spikes at precisely the same time.
That wasn't all. Not only had the virtualization environment seen the same latency, it was also generating approximately 200MBps of I/O per host across a cluster of eight hosts -- an I/O load well in excess of what the storage back end could handle gracefully. Further investigation showed about half of the 200 production VMs started directing massive amounts of I/O at the SAN at precisely the same time and finished doing so about 15 minutes later.
In the end, it turned out that a configuration problem on the antivirus management server had resulted in a large number of the VMs reverting to an unmanaged state in which they would source their own antivirus definitions and, as in the VDI example above, perform an immediate system scan upon applying them. The load produced by 100 virtual machines on high-performance virtualization hosts all thrashing their disks as fast as they could was more than enough to bring the SAN to its knees.