Even a disaster can’t save this IT department

Budget cutbacks, overworked employees, and out-of-date systems lead to a middle-of-the-night emergency -- but little response from management

Even a disaster can’t save this IT department

We in IT are used to toiling in the shadows until more visible departments run into problems. Then it's our duty to use our limited resources to clean up the mess and slink back to our corner until the next emergency. All the while, management has no idea it continues to make our jobs harder when implementing a few IT initiatives could help the entire organization.

I worked in a business climate where change was hard to come by. The prevailing circumstances:

  • Operations were the top priority at the company.
  • Not much budget given for IT to support operations.
  • IT was not given a maintenance window to take care of tasks (operations more important).
  • The IT department had a high turnover rate.

Many of us know the difference between designs meant to be minimally functional and those that are a lot more robust. Our business (operations) fit the first category and ran basically 24/7, as the business hours went 8/5.

A problematic environment

Though operations covered multiple shifts, IT had only enough personnel for one shift -- or in our case, 1.5 shifts with some people starting earlier and others starting later. There was not enough knowledge overlap to have totally independent shifts. Even with the existing staff, we barely had enough time to support our user base, never mind tackling new projects -- we needed at least two more people on the IT team.

As operations ran pretty much 24/7, they demanded 24/7 uptime -- and support when systems went down. To support a business running 24/7, one needs robust infrastructure, which we didn’t have due to budget constrictions. There also wasn’t enough budget allocated to build redundancy to support better performance, reliability, or scalability.

Even when we had a budget to shore up certain parts of the infrastructure, there was hardly a time when IT could put in new systems or perform basic maintenance.

As a result of these factors, systems were not as up-to-date as they could/should be (which in itself can cause problems). There was little, if any, redundancy of systems, so we experienced many single-point-of-failure instances.

Our systems were also not as reliable as they could be, because the pressure for cost-cutting brought in what I considered “SMB” systems versus enterprise systems. We’d been trying to keep it all running given the constraints, but as they say, when it rains, it pours.

A day of reckoning ... that wasn't

One time, we had a long blackout, which drained our UPS units. We also didn’t have a generator to fall back on, because senior management was not allocating money for a replacement even though I’d been asking for funds for its replacement for several months.

When the power came back up, we found out that almost nothing was accessible on the network. After frantic troubleshooting (at 3 a.m.), we found out that whoever had set up the core Cisco switch had forgotten to save the running configuration. Due to the high IT turnover rate, nobody had kept documentation (I’m guessing they were too busy), and there was no backup configuration we could quickly restore.

Suffice it to say, it was a very long night/day of tracing a cabling mess, reconfiguring all the VLANs, and getting all the systems back up and running. Of course, I made sure to save the new configuration -- and a copy of it elsewhere.

In spite of such circumstances, it seems like senior management rarely changes its ways, and this was no exception. In IT we can see the problems coming a mile away, but with no changes or support, it feels as if we hit the same wall again and again.