Windows 7 on multicore: How much faster?
Microsoft has been touting its superior handling of threads in Windows 7. InfoWorld's tests show that speed isn't the only benefit, or necessarily the main one
These results suggests that when considering Windows 7, performance should be viewed as a reasonable justification for upgrading from Windows XP, but not a driver for migration from Vista. (There are unrelated reasons to upgrade from Vista, as discussed below and in other articles on InfoWorld.) The flat performance results against Vista are reasonable given that, as we noted earlier, Windows 7 is based on the Vista kernel.
What might be surprising is that Windows 7's multithreading changes did not deliver more of a performance punch. The explanation for this lies in what changed in how Windows 7 manages threads. The principal changes consist of increased processor affinity (see the sidebar, "How Nehalem processors and Windows 7 work together") and changes to the Windows kernel dispatcher lock. This eye-glazing term refers to a core aspect of modern operating systems: how the kernel prevents two threads from accessing the same data or resource at the same time.
Anytime a thread wants to access an item that might be claimed by another thread, it must use a lock to make sure that only one thread at a time can modify the item. Prior to Windows 7, when a thread needed to get or access a lock, its request had to go through a global locking mechanism. This mechanism -- the kernel dispatcher lock -- would handle the requests. Because it was unique and global, it handled potentially thousands of requests from all processors on which Windows ran. As a result, this dispatcher lock was becoming a major bottleneck. In fact, it was a principal gating factor that kept Windows Server from running on more than 64 processors.
New locking mechanism
Windows 7 includes a wholly new mechanism that gets rid of the global locking concept and pushes the management of lock access down to the locked resources. This permits Windows 7 to scale up to 256 processors without performance penalty. On systems with only a few processors, however, the old kernel dispatcher lock was not overburdened, so this new mechanism provides no noticeable improvement in threading performance on desktops and small servers.
The new improved processor affinity discussed in the sidebar does not show up in the performance results. On runs with SMT disabled, this was expected because the benchmarks use all resources available; no Turbo Boost is possible. When we ran the four-thread Viewperf benchmark with SMT enabled (giving the benchmark eight processing pipelines), the results were essentially unchanged. That is, the differences were immaterial, which suggests that Turbo Boost works best in narrowly constrained settings, rather than the typical threaded applications we tested. Despite several requests, Microsoft would not comment on these results.
The Cinebench benchmark is a ratio that measures how much faster the multiple threads are than running the benchmark with one thread; it's a true measure of how the threading scales when measured by rendering performance. Cinebench showed negligible differences in performance across the three operating systems -- both with SMT disabled and with SMT enabled. However, unlike with Viewperf, the results for all three Windows were distinctly better with SMT enabled; i.e., Cinebench rendering ran nearly 20 percent faster on eight threads (SMT on) than four (SMT off), regardless of the version of Windows. This divergence between the two benchmarks regarding SMT's benefit underscores the need for testing its effect on your existing applications before deciding whether to enable it.