Once you move your core IT systems into private or public cloud networks, your work isn't over. Now you have a different set of technology issues to deal with: managing the cloud to ensure that your investments pay off for your enterprise and deliver the efficiencies and ROI that you're expecting.
Cloud management and monitoring have become even more important in the wake of April's Amazon Elastic Compute Cloud (EC2) outage, when the IT world got to see just what happens when a cloud environment runs into problems, taking the operations of many companies down with it. There have been other recent serious cloud outages as well.
[ Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in the InfoWorld editors' 21-page Cloud Computing Deep Dive PDF special report. | Stay up on the cloud with InfoWorld's Cloud Computing Report newsletter. ]
Getting the performance that your enterprise is paying for is "one of the big 'gotchas' for public clouds," says Mary Johnston Turner, an analyst at IDC. In a recent study of 250 user companies, service-level agreement (SLA) performance guarantees ranked second in importance after the specific needs of the applications themselves, she says.
"Enterprises are very concerned about performance," she says. "One of the reasons you're seeing so much interest in private clouds is because IT leaders are responsible for getting good performance to their users" and they aren't always ready to hand those huge responsibilities over to third-party cloud vendors.
And that, she added, is not just a cloud problem but is one created by the complexity of composite applications that then are introduced into cloud environments.
"It's a huge challenge," Turner says. "Users need to be investing in application performance management [products] that are built for composite applications and virtualized environments. There's a whole category now."
The idea, she says, is to be able to independently monitor the performance of the applications as they go across the network or the cloud, and then be able to measure that performance where it reaches the end user, whether that is inside or outside the firewall, Turner says.
For David Ting, vice president of engineering at IGN.com, one of the largest video game review websites in the world, monitoring his company's cloud performance is critical because the business lives or dies based on the ability of its 25.4 million users to connect with the site's ad-supported online properties.
"For us, performance is money because page views are key," he says. "We're ad-supported, so every page view counts" and helps the company bring in revenue. "These are things that we watch very carefully."
To make it all work, IGN Entertainment -- a division of media giant News Corp. -- uses performance monitoring tools from San Francisco-based New Relic that allow IGN to continuously watch over the performance of its sites in the cloud. "We depend very heavily on that tool," Ting says. "For us it's about response time and transactions per second for our IGN websites."
Tracking performance as cloud deployment expands
IGN.com has been using the New Relic tools for about 18 months. It started out by moving non-production development and other applications to the cloud to see how things worked. Now IGN.com is putting some new projects onto cloud servers, including a social media stack, so the company can ramp up applications and scale them as needed, Ting says. Also being slated for placement in the cloud is the network's disaster recovery infrastructure.
"It could eventually all go to the cloud," Ting says of the company's IT systems. "Performance stability would have to be more certain in the future for us to do that, but we're watching that."
The monitoring from New Relic provides performance metrics IGN couldn't get when it was using other tools, he says. The old tools "did OK for physical machine monitoring, but didn't do application stack monitoring at all without a lot of work from the engineering team."
By watching the New Relic management tools, IT workers can spin up more cloud-based servers, bring down poorly performing instances of applications, then add new instances as needed to keep up response times for users, he says. With the previous tools, Ting's team would obtain insights only into uptime, not response time.
"New Relic gave us tremendous visibility into the response time," which allows IT staffers to take actions on servers even when the servers are running, Ting explains. For example, "we have found instances where one memcached server performed much worse than others in the pool. Upon further investigation, we found one of the memory modules to be defective. In the Nagios world, that server would be running in the pool until it dies."
IGN.com is using Amazon's EC2 for its forays into the cloud today, Ting says.
With New Relic, IGN.com can watch over all the parts of its three-tiered architecture, from its front end to its databases to its API tier. The management tools help ensure that user response times stay optimal and don't spike.
"We can look at what's running on the cloud" using plug-ins that collect data and send all the analytics back to the New Relic tools, Ting says. "They give you very detailed reporting on how the server group is performing," he adds.
"The amount of data and the precision of the data is tremendous," Ting says. "This is where we can start looking at the metrics and be able to make intelligent business decisions with it."
In addition to moving its IT infrastructure, IGN.com has been exploring the cloud to host many of its more than 100 websites for increased performance and uptime, Ting says. The main sites include IGN.com, Askmen.com, Gamespy.com, Fileplanet.com, Teamxbox.com and Gamestats.com.
So far, the trials have been looking positive, Ting says. "We've got some infrastructure pieces moving out into the cloud," he notes. "It's in the experimental stage right now, and we're checking performance."
Using a variety of tools
Bleacher Report, an online publisher of fan newsletters about professional and college sports, also quickly found out about the importance of performance monitoring after it moved its core infrastructure to the cloud a year ago.
Sam Parnell, vice president of technology for the San Francisco-based business, says his firm was concerned about potential performance problems -- including possible latency issues -- as it worked to scale up for its 20 million unique users and 500 million page views per month. To prevent bottlenecks, he brought in a host of tools to monitor and manage the new cloud environment for the ad-supported sites.
"There is no one tool that does everything for us," Parnell says. "We use a variety of tools at different levels that give us a comprehensive monitoring suite. So far there haven't been latency issues, but we have used them to optimize various parts of the system."
The company's toolbox includes Scout, a server-level tool that allows IT staffers to see what loads look like on master and slave databases, as well as CPU utilization and memory consumption on servers, he says. The monitoring is done using agents that run on the cloud servers and report back with alerts and status data.
Also used are monitoring tools from Nagios Enterprises and open-source tools from Monit.
"There's certainly a good amount of overlap to these tools, but they all have things that each does well, which is why we use them together," Parnell says.
Bleacher Report also uses pinging tools from Pingdom to ensure that its various sites are up and running and performing well in the cloud.
In every case of monitoring, 100% uptime and fast page response is critical, according to Parnell. "If people aren't able to get to the website and see the advertising, then we're losing money."
The company also uses New Relic for application performance monitoring so the IT staff can get performance insights into which pages are running fast or slowly, memory consumption and CPU usage.
Watching in real time
The monitoring data arrives in real time on screens that his staffers are constantly watching, Parnell says.
The key, he notes, is to monitor with a wide range of products so you can get as much information as quickly as possible to make fixes when problems arise. "In general, I'd prefer to err on this side of too much data than not enough," he says. "New Relic does a great job of surfacing the important information in a dashboard so you don't have to wade through data. That helps when you want to take a quick look at what is going on."
To watch the performance in real time, Parnell's team uses several large monitors that constantly cycle through different reports so they can be watched all day by team members. "We aren't digging through this all day every day, but we do monitor for things [that] look out of the ordinary," Parnell explains. "All of these tools do give us deep data when we do need to dig deeper."
The monitor screens are watched mostly by a team of lead engineers, particularly when new features are being deployed or around times of high load.
Another major point to remember, Parnell says, is that cloud environments and cloud monitoring are still in their infancy. IT departments need to be flexible, finding and using cloud monitoring tools but still looking for new ones that could be even better, he says.
"We've only been using Scout for five or six months and it's working really well now, but in five months it could be something else" that does the job better, Parnell says. "You need to keep your finger on the pulse of the market so you can follow new tools. There are new companies popping up all the time."
Another thing to remember, he says, is that you have to constantly monitor the servers provided by your cloud vendor to be sure you always have the best-performing units.
It's one of the biggest benefits of using a cloud, Parnell says. "With a cloud, you can just ditch a slow server and get another one through your control panel."
The monitoring tools are also used in-house to improve the development of new website features aimed at Bleacher Report readers.
"If an engineer is deploying a new feature, I want them to be looking at performance and make sure that it is not adversely affecting performance elsewhere," Parnell says. "We continue to tune and refine everything within the system to be sure it as fast as possible. If a big sports story breaks, we can have a big spike in traffic. Everything needs to scale, and we need to be able to handle that."
Know what you're getting -- and what to monitor
To get the performance your company really needs, you have to lay out your specific requirements to your cloud vendors, says analyst James Staten of Forrester Research.
"One of the first things that is important is transparency, which is 'What exactly is the performance that they are delivering to you?' " he says. That includes asking about what levels of monitoring they allow you to do directly and what logs will they send to you so you can see what's happening.
"If they aren't providing it, ask them for it," he says.
A huge part of your relationship with your cloud vendor is managing your expectations, Staten says. Any performance monitoring that you want to do is your responsibility, not your vendor's, he notes.
If you're not up for doing this kind of monitoring yourself, there are plenty of companies that you can hire to do it for you, Staten says, including HyperStratus, Keynote Systems, Hewlett-Packard, IBM, Accenture and others.
"A lot of people think their SLAs cover them for performance monitoring, and they don't," he says. "SLAs cover availability, but that's it."
At the same time, not all applications and services that your company runs on cloud networks will be mission-critical, he added, so you might not have to monitor the performance of everything on the cloud. "You have to figure out what those [critical] applications are," Staten says.
End-to-end cloud management still a ways off
One final thing to consider, IDC's Turner says, is that the cloud performance-monitoring marketplace is still very immature.
There are many vendors "that will talk to you about that from a road map point of view, but there's very little there yet" in complete packages. "This year is still going to be strongly around automating the provisioning pieces" that will allow true end-to-end cloud monitoring, she says. "I think we'll see more sophistication as the year goes by."
As more companies begin transitioning to production environments in the cloud, the need for monitoring will become even more acute, she says. "I think this is going to be a priority investment area for many organizations this year," Turner predicts. "It will probably take another year or two out to get there due to the sophistication that's needed."
Of course, there's a catch-22 to all of the monitoring needs, Staten says. By the time you pay for monitoring to help ensure that you're getting the performance you've contracted for, it's also possible that you can erode the cost savings that got your company into the cloud in the first place, he says.
"If you are spending a lot of money to deal with latency issues," Staten says, "then should you even be in the cloud?"
Todd R. Weiss is a former freelance writer and a newly minted senior writer at CIO.com, a sister site to Computerworld. Todd can be reached at email@example.com.
This story, "Managing your cloud's performance: Best practices" was originally published by Computerworld .