It's been a difficult two weeks for Rackspace and its users, with two power outages in a co-location facility interrupting service for an estimated 2,000 customers.
Rackspace, which prides itself on “fanatical support,” has been open about its failures, communicating with customers directly and through the company's official blog and Twitter account. Open communication and a commitment to fixing technical problems will both be crucial for Rackspace as it attempts to repair damaged credibility, says CEO Lanham Napier.
[ Keep up on the latest networking news with our Technology: Networking newsletter. ]
“Any time we have an incident like this, it does impact our credibility,” Napier said in an interview Friday with Network World. “The only way we earn it back is we have to execute at a high level for a long time.”
Power outages on June 29 and July 7 hit Rackspace's 144,000-square-foot data center in the Dallas suburb of Grapevine. Rackspace operates nine data centers worldwide for about 60,000 customers. Within the Dallas facility, some customers experienced downtime of about 40 minutes on June 29 and on July 7 some customers suffered downtime of 15 to 20 minutes.
The facility has three “phases,” or physical areas, and both outages hit the same phase, affecting a total of about 2,000 customers, according to Rackspace. Judging by comments on a recent Network World article, reactions range from anger at Rackspace for not eliminating every point of failure to acceptance that downtime can never be completely prevented and that Rackspace did well in quickly solving the problems and communicating with customers.
“I’m sure there will be some [customers] who are upset with us,” Napier said. “Let’s face it. We let them down. It wouldn't surprise me if some customers leave. I hope most of them stay with us.”
Rackspace has said it will issue between $2.5 million and $3.5 million in service credits to customers. Depending on the service a customer has paid for, service-level agreements can range between 99.9% uptime to 100%, Napier said.
On June 29, Rackspace suffered a utility power interruption, and was forced to move equipment over to generator power. The generators initially held the load and then failed, resulting in 40 minutes of downtime, Napier said.
An incident review cited failure of generators to synchronize with UPS systems, and failure of switches in the electrical infrastructure, preventing transfer of electrical load between different power sources. By July 3, the Rackspace blog reported that maintenance to the generator had “eliminated the excitation failures that caused recent customer disruptions.”
Trouble struck again on July 7 with the failure of a bus duct, a 10-foot, 300-pound piece of copper that distributes electricity. This prevented proper operation of a UPS system, taking customer servers down for about 20 minutes before Rackspace could connect them to generator power. The generators worked this time and carried the load for hours while workers replaced the bus duct, Napier said. Rackspace is still investigating the root cause of the bus duct failure, he said.
Whether an individual customer suffered downtime was in some cases determined by the level of service they've paid for. For example, some customers pay for a higher level of service that lets them draw power from different phases of the facility, and were able to avoid downtime, Napier said.
This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.
Download now »Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.
Download now »
The emergence of WLANs has created a new breed of security threats to enterprise networks.
Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation
Effectively address data protection challenges, implementing solutions that help store and protect businesscritical data while cutting costs and improving efficiency and reliability.
Download now »
Sign up to receive Networking Resource Alerts
