The inside scoop on how is getting fixed

Rackspace CTO John Engates tells InfoWorld about his recent White House briefing on the improvements to

Yesterday, relaunched with a new user interface and a bunch of back-end improvements. It remains to be seen how stands up under a full load -- and questions linger about whether applications are being properly delivered to insurers. But a recent conversation with John Engates, CTO of Rackspace, convinced me that at least things are headed in the right direction.

Along with half a dozen other tech leaders, Engates was invited to the White House Situation Room for a briefing on Nov. 25 with Chief of Staff Denis McDonough. Engates also met with the new point man, Jeffrey Zients, and toured the operations center at QSSI, the government contractor charged with pulling together other contractors and agency personnel for a project that had previously foundered without any real central management.

To be clear, Engates is not part of the "tech surge" that has been helping to fix Engates believes he and the other invitees were brought in because "they were looking for some validation that they were doing the right things. They also wanted to open a dialogue on what needs to change in the way that the government procures IT...and maybe open it up to a broader set of contractors in the future that aren't the usual suspects."

49opreality hp

Sounds like a worthy goal. Many of the improvements Engates outlines amount to basic best practices that should have been in place from the start. This fits a familiar pattern in many big federal IT projects, where perverse incentives upend common sense and eliminate from contention the best candidates for the job.

Getting under one roof

According to Engates, just the fact that everyone is under one roof at QSSI's operations center is a huge improvement. That crew includes point people from CGI Federal, the contractor that has taken the brunt of the blame for the bungled site, and Verizon Terremark,'s hosting provider. In addition, there are people from Mark Logic, the database software vendor; the operating system vendor; the monitoring services provider; and so on.

"People are literally sitting in the same room at desks with multiple computers," says Engates. "They have a morning meeting, an afternoon meeting, standup meetings. When you put people in one room and you have individual accountability with a face and a name attached to it, it's a lot harder to point fingers."

Inviting outside perspectives

Instead of just "yelling louder at the contractors," says Engates, they've brought in outside specialists. "I don't want to name names," he says, "but some very well-respected companies that we all know have people that are sort of on loan or maybe on leave from their companies. People whom I respect in terms of their ability to run big-scale websites." As reported by the New York Times yesterday, that roster includes Michael Dickerson, a site reliability engineer at Google.

Because these outside people are participating directly and are "literally there every day" working "16- or 18-hour days," says Engates, they've collectively raised expectations and intensity to a level higher than you'd see in a typical government contract job.

More automation, real testing

Orchestration and automation tools that much of IT now relies upon were previously absent from, says Engates, which meant that admins were logging into individual servers to make changes -- and increasing risk of human error. Today that automation has been put in place. "This seems obvious to a company like Rackspace that has to run things like a devops shop, but a lot of enterprise shops don't necessarily do that yet," he says.

Also, as elementary as it seems, testing is now being done using a staging environment. Engates says, "All this stuff is getting tested before it goes into production, and I'm not sure that was happening consistently before." According to several accounts, neither proper testing nor automation were put in place before launch simply because the parties involved ran out of time.

Improved monitoring

"Up until the surge, when they brought in those third-party experts, I don't think they had good visibility into what was going wrong and where," says Engates. In other words, individual contractors had monitoring systems for their own bits of the project, but no big picture was available. "I don't think there was comprehensive way to look at the logs across different systems and find root causes," he adds.

In some cases, Engates observes, even the contractors lacked visibility into the infrastructure layer because that was under the control of the hosting provider, Verizon Terremark. "What I think they've got now is a monitoring system that looks down into the application layer and rolls up some of the performance information with the hardware. They have a much better picture with this system," he says.

Adding horsepower

Now that has been running for a while, says Engates, hosting provider Verizon Terremark has been able to throw more powerful hardware at bottlenecks that normally would have surfaced in load testing cycles prior to launch. Engates adds that because the site is for domestic rather than global use, Terremark has the luxury of being able to bring portions of the system down in the middle of the night for upgrades and maintenance.

Lightening the load

Most people have heard about the fatal decision to register people before they could see their plan options. As Engates observes, this turned the conventional e-commerce funnel on its head. Instead of going from general browsing to entering the specific info necessary for a transaction, applicants entered that information first, which incurred a string of activities and dependencies that ultimately brought the system to its knees.

A host of other databases needed to be pinged, including those maintained by the Social Security Administration, Health and Human Services, Homeland Security, and the IRS for such tasks as determining citizenship or whether an applicant qualified for supplemental assistance. As Engates explains, "before, when you compared plans, not only did it have to look up who you were, then it started to look up the plan information from all these data sources. Every time somebody would load a particular page, it was doing multiple queries against multiple databases. Just optimizing the way the system works, presenting data that's pre-computed instead of computed on the fly, I think that can have a tremendous benefit -- and I think they've done quite a bit of that."

Moreover, the option to "see plans before I apply" is now prominently displayed on the home page. And of course, third parties can provide Affordable Care Act health plan comparisons as well. Three young San Francisco developers made a name for themselves a few weeks ago by building a site in a matter of days called HealthSherpa, which enables applicants to find which policies are available in their area, whether they may be entitled to subsidies based on income, and how to obtain coverage that meets their needs.

Fixing the basic problem

The story is sadly familiar. Back in 2005 I wrote a story about the FBI's Trilogy project, whose $170 million Virtual Case File system was never implemented at all. In that instance, you could blame ridiculous requirements bloat as the main culprit. But in a fundamental sense the same dysfunctional dynamic that whacked the Trilogy project also hammered

As I learned back then, government contractors have huge incentive to agree to unrealistic requirements and deadlines thanks to "cost plus" contracts, which were given both to CGI Federal for and to SAIC for the Trilogy project. These types of contracts estimate the real cost of a project and add a profit margin that is awarded annually to the contractor -- in full, in part, or not at all, depending on the government's rating of the contractor's performance for that year.

For the Trilogy article, Gartner Fellow John Pescatore explained the effects of cost-plus contracts in colorful terms: "Here's what happens. In the beginning, you never want to say no, because you'll get a bad rating. It essentially incents the contractor to be much more accepting of out-of-scope changes. It's kind of like a mass-suicide pact, except you're hoping a miracle is going to occur later on."

It's stunning to me that cost-plus contracts, which have been implicated in outrageous Defense Department waste for decades, are allowed to persist across the federal government. Despite failed projects, many of the same contractors pop up again and again. As Engates notes, winning a government contract appears to depend more on the contractor's skill at vaulting bureaucratic and political hurdles than on the ability to build, deploy, and maintain viable systems.

That's how it appears to have gone down with Engates puts it this way: "There are a lot of guys that could have probably done a great job on this. But they aren't necessarily competing on these contracts. And even if they were, they don't have all the right characteristics to win one. In the way these contracts are awarded today, it's very hard to get the right folks on the job."

This article, "The inside scoop on how is getting fixed," originally appeared at Read more of Eric Knorr's Modernizing IT blog. And for the latest business technology news, follow InfoWorld on Twitter.

Copyright © 2013 IDG Communications, Inc.