The inside scoop on how is getting fixed

Rackspace CTO John Engates tells InfoWorld about his recent White House briefing on the improvements to

Yesterday, relaunched with a new user interface and a bunch of back-end improvements. It remains to be seen how stands up under a full load -- and questions linger about whether applications are being properly delivered to insurers. But a recent conversation with John Engates, CTO of Rackspace, convinced me that at least things are headed in the right direction.

Along with half a dozen other tech leaders, Engates was invited to the White House Situation Room for a briefing on Nov. 25 with Chief of Staff Denis McDonough. Engates also met with the new point man, Jeffrey Zients, and toured the operations center at QSSI, the government contractor charged with pulling together other contractors and agency personnel for a project that had previously foundered without any real central management.

[ For more on the next-generation Web, read "The triumph of JavaScript" and "You're already living in the cloud." | Subscribe to InfoWorld TechBrief for quick, smart takes on all the news you'll be talking about. ]

To be clear, Engates is not part of the "tech surge" that has been helping to fix Engates believes he and the other invitees were brought in because "they were looking for some validation that they were doing the right things. They also wanted to open a dialogue on what needs to change in the way that the government procures IT...and maybe open it up to a broader set of contractors in the future that aren't the usual suspects."

Sounds like a worthy goal. Many of the improvements Engates outlines amount to basic best practices that should have been in place from the start. This fits a familiar pattern in many big federal IT projects, where perverse incentives upend common sense and eliminate from contention the best candidates for the job.

Getting under one roof

According to Engates, just the fact that everyone is under one roof at QSSI's operations center is a huge improvement. That crew includes point people from CGI Federal, the contractor that has taken the brunt of the blame for the bungled site, and Verizon Terremark,'s hosting provider. In addition, there are people from Mark Logic, the database software vendor; the operating system vendor; the monitoring services provider; and so on.

"People are literally sitting in the same room at desks with multiple computers," says Engates. "They have a morning meeting, an afternoon meeting, standup meetings. When you put people in one room and you have individual accountability with a face and a name attached to it, it's a lot harder to point fingers."

Inviting outside perspectives

Instead of just "yelling louder at the contractors," says Engates, they've brought in outside specialists. "I don't want to name names," he says, "but some very well-respected companies that we all know have people that are sort of on loan or maybe on leave from their companies. People whom I respect in terms of their ability to run big-scale websites." As reported by the New York Times yesterday, that roster includes Michael Dickerson, a site reliability engineer at Google.

Because these outside people are participating directly and are "literally there every day" working "16- or 18-hour days," says Engates, they've collectively raised expectations and intensity to a level higher than you'd see in a typical government contract job.

More automation, real testing

Orchestration and automation tools that much of IT now relies upon were previously absent from, says Engates, which meant that admins were logging into individual servers to make changes -- and increasing risk of human error. Today that automation has been put in place. "This seems obvious to a company like Rackspace that has to run things like a devops shop, but a lot of enterprise shops don't necessarily do that yet," he says.

Also, as elementary as it seems, testing is now being done using a staging environment. Engates says, "All this stuff is getting tested before it goes into production, and I'm not sure that was happening consistently before." According to several accounts, neither proper testing nor automation were put in place before launch simply because the parties involved ran out of time.

Improved monitoring

"Up until the surge, when they brought in those third-party experts, I don't think they had good visibility into what was going wrong and where," says Engates. In other words, individual contractors had monitoring systems for their own bits of the project, but no big picture was available. "I don't think there was comprehensive way to look at the logs across different systems and find root causes," he adds.

1 2 Page 1
Page 1 of 2
How to choose a low-code development platform