Antivirus: The silent virtualization killer

Traditional guest-based antivirus can sap the life out of your virtualization and storage infrastructure -- learn how to recognize the problem

Life in IT is full of onerous tasks. Along with making good backups and maintaining a solid patching regimen, you must ensure that multiple levels of antimalware software are properly deployed. Unfortunately, in heavily virtualized environments, antivirus can go beyond being a pain to manage and actually become a threat in and of itself. As the saying goes, sometimes the cure is worse than the disease.

That antivirus software can slow down a machine probably comes as no surprise to anyone. Any software that watches each and every disk I/O and inspects it for threats adds overhead that didn't previously exist. In most cases, this manifests itself through marginally higher disk latency and greater CPU load. But with careful use of scanning exclusions (for heavily used databases and the like), it's usually not enough to bring a system to its knees.

Recently, however, I've been presented with two excellent examples of how antivirus run amok can have enormous sitewide impact -- and how it can be difficult to detect the cause unless you know to look for it and have the monitoring data necessary to do so.

The new VDI environment

In the first instance, a client was in the process of bringing a new VDI environment into production. The base image had been fully tested, and the user base was excited to get rid of their ancient desktops and take advantage of the session portability that VDI would give them. Initial user testing had gone well, and no problems were detected.

However, as larger numbers of desktops were automatically deployed and user count expanded, performance started to suffer. First, things were a bit sluggish for everyone, but as the rollout proceeded, it became dramatically worse to the point that users eventually started to miss those old desktops. Initial investigation on the virtualization hosts didn't show any significant CPU or memory contention, so attention quickly turned to the SAN.

Digging through the management interface of the SAN, it immediately became clear that the problem was indeed storage related, with latencies peaking well above 20ms. As is often the case in such situations, the fear that the SAN wasn't up to the job of serving a VDI environment started to build.

Fortunately, the troubleshooting process didn't stop there. Further investigation into the SAN load revealed than an average of 40 IOPS was being generated by each and every VDI desktop for about an hour after booting -- far outside of the norm and much higher than what had been seen during initial image development and testing.

Eventually, it was determined that as the nonpersistent desktop images booted, their antivirus agent was sourcing new virus definitions from the management server, then performing a full system scan to ensure that no newly detectable risks had gotten through. This is a commonly used and perfectly reasonable approach in a physical desktop environment, but in a virtualization environment, it resulted in nothing short of the brutal murder of the underlying shared storage hardware.

As a test, deployments of new AV definitions were disabled (on the AV platform in use, it was impossible to disable the automatic scan). The result was a tenfold decrease in disk I/O during the early morning hours as new users logged in and new desktops were deployed -- effectively working around the problem, although it left the task of manually updating the signatures in its wake.

The crashed SQL cluster

In another instance, a client reported that a mission-critical application had become unresponsive. Initial troubleshooting showed that the highly redundant clustered database services had actually gone offline. After the services had been restarted and service restored, investigation of the SQL server logs showed that the database services had experienced a period of extremely high disk latency, which culminated in the service giving up and terminating.

1 2 Page 1
Page 1 of 2
How to choose a low-code development platform