Netflix and Linux performance analysis in 60 seconds
Netflix has a very big EC2 Linux cloud and makes good use of performance tools to keep track of how well it is working for the company. In a recent blog post Netflix shares how it investigates performance of its Linux cloud at the command line.
Brendan Gregg reports for Netflix:
You login to a Linux server with a performance issue: what do you check in the first minute?
At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance. These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools.
In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available.
In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.
The Netflix blog post caught the attention of Linux redditors and they shared their thoughts about it:
Psi: ”If the box is not in extremely laggy state on login then the first thing that gets written is "htop".”
Michalf: ”We always start with htop, top, atop to see the "big picture" first. At this point in 90% cases we already know where to go next. Only after that we dig into details like iostat etc. nload is OK for quick network traffic check too.”
ilikerackmounts: ”Hah, you are the first and only person I know of that gives any creedance to load averages out of uptime. I've read your books, I know why it's mildy useful in the first 10 seconds after login, but still I've never know anyone who went to this command as their first look as opposed to top.”
Sendmetohell: ”Load averages are usually a good way of getting perspective. I use it rather than top because I want someting succinct and on a single line.
I also wouldn't look at top for more reliable metrics, I'd be more inclined to look at sar -u or something. I care more about what it's been doing rather than what it happens to currently be doing.”
Decwakeboarder: ”w - 5 less keystrokes than uptime and more information. The first thing I check on a box is to make sure I'm not going to undo work another admin is doing or vice versa.”
Chaporouge: ”I cannot believe I've never seen that command. I mean, I'm no veteran, just a huge fan of unix-like systems, never seen it. Really useful, thanks !”
Brendangregg: ”It's partly habit, and I just want it on a line, and I want it in my scrollback buffer in case the server vanishes as I'm debugging (and top's output usually isn't there). I've done much post-incident documentation based on scrollback, as either the server is gone or the issue went away.
'w' is ok too. I hope I'm not the only person who uses uptime for load averages; I'm reminded of the Coukoo's Egg, where the cracker had a distinctive usage of 'ls'!”