Implementation. Ultimately, it doesn't matter how good the product is -- if it's not implemented properly, chances are it's going to break or at the very least perform poorly. Complexity makes incorrect implementation more likely. Fortunately, with proper testing, most implementation errors can be rooted out before systems go into production. Failure to perform adequate acceptance testing will leave the discovery of the worst of these problems until the systems are under load.
Maintenance and testing. In my experience, lack of appropriate maintenance and testing are the two largest factors that contribute to downtime of all kinds. The reasons why this is true should be obvious to anyone working in IT: We're all being asked to do more work with fewer resources.
I honestly can't recall the last time I saw an IT department where an employee didn't have enough work to do to justify his or her job. It's usually the opposite: The business is asking for new functionality faster than IT can deliver it -- so that regular maintenance and appropriate levels of testing fall by the wayside.
What you can do about it
If you do absolutely nothing else, test. Test everything as frequently as you can. Test backups, failover clusters, redundant switches, and SAN snapshots -- test anything that you've spent good money on to save your bacon if something breaks. Make sure to test under non-ideal circumstances -- don't check to see that everything is working properly before you test, because you won't have that luxury in a real failure. Don't shut things down cleanly, pull the plug. Assume that if you haven't tested something in the past three months or since major architectural changes have been made that it just won't work like it's supposed to.
If you ask management or business stakeholders for the necessary time to do the testing and are denied it for whatever reason, make it crystal clear that you can't guarantee failover systems will function properly in a failure scenario. This will not make you popular. But trust me, it's a heck of a lot better than being shown the door because the failover system you're responsible for didn't work.