In IT we depend heavily on monitoring (or we should if we're not already), so it's even more painful when that monitoring lets us down. In this case, there was no legitimate warning that overrode the healthy appearance of the UPS. This is the risk we take in trusting all that monitoring, and it's a risk that cannot realistically be prevented.
A manual task would have prevented the problem
The only thing that could have possibly prevented this outage is an old adage that I use constantly: Fire it before it can quit.
There are several common IT elements where you can apply that saying: hard drives, batteries, and some IT admins. In this case, the batteries may have been showing full capacity, but they were in fact three years old and should have been on a replacement schedule. An APC tech took a look at the UPS following the load-drop event and determined that even though the batteries appeared fine, when a nonload self-test was run, the unit's output dropped severely, even without a load. Either the batteries were lying to the monitoring code in the UPS, or the monitoring code was lying to everyone else. Either way, the result was a day of chaos, lost work, time, and effort.
So even if you're monitoring everything under the sun and keeping tabs on even the tiniest component in your infrastructure, take a moment to realize that sometimes the best thing you can do to reduce problems is to replace parts that might seem to be working fine.