The problem was that some recent changes designed to improve traffic flow on request routers, servers designed to direct Web queries to the appropriate Gmail server, overloaded the system after workers took some Gmail servers offline to perform routine upgrades.
"As we now know, we had slightly underestimated the load which some recent changes placed on the request routers," Ben Treynor, site reliability Czar wrote on the Gmail blog. "At about 12:30 p.m. Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them also to become overloaded, and within minutes nearly all of the request routers were overloaded."
The overload resulted in people around the world being unable to access Gmail for about 100 minutes, Treynor said, though he noted that IMAP/POP access and mail processing continued to work normally.
Gmail engineers were alerted to the problem within seconds of the failures and after figuring out what the problem was, brought additional request routers online. Now, Gmail is more than 99.9 percent available to users, he said.
"We've turned our full attention to helping ensure this kind of event doesn't happen again," he wrote.
One fix the company plans to make is to ensure request routers will work better by having them slow down when overloaded instead of refusing to accept traffic. Treynor said the request routers need to have sufficient failure isolation so that a problem in one data center doesn't affect servers in another data center.
The company will work over the next few weeks to make these changes and further improve reliability, he said.