The Power5’s Level 2 cache totals just less than 2MB. With a shared cache, data fetched by one core is immediately available
to the other, increasing the likelihood that fetching the next program instruction or block of data won’t require a trip to
performance-killing RAM. But the shared cache also makes it more likely that the cores will try to access the cache at the
same time, which they cannot do.
IBM implemented a cache-contention stopgap, splitting the Level 2 cache into three segments. This design permits quasi-simultaneous
access to cache as long as both cores are hitting different cache segments. IBM has another creative solution to the Level
2 cache-contention issue: a ponderous 36MB external Level 3 cache. Each core owns its Level 3 cache exclusively, so there’s
no possibility for conflict between cores. Although Level 3 cache isn’t nearly as fast as Level 2, Level 3 is much faster
than main memory, and Power5’s design makes the connection between its core and its associated Level 3 cache a direct link.
We consider IBM’s reworking of the Level 3 cache design to be one of the top design wins in Power5.
Another substantial Power5 gain is its on-chip memory controllers. Each Power5 core has its own controller and is capable
of managing a dedicated block of main memory. This has a huge impact on overall performance, as we’ve seen in comparing the
memory throughput of Opteron and Xeon, for example. And in Power5’s case, the design fits with IBM’s strategy of multilevel
parallelization.
Two is not enough
Power5 isn’t just dual-core; it implements Power4’s SMT (Simultaneous Multi-Threading) facility, which gives each core the
capability of executing instructions from two threads simultaneously, under certain conditions. SMT is similar to Intel’s
HTT (Hyper-Threading Technology) but with distinct advantages that make “certain conditions” broader and that dynamically
optimize parallelization by analyzing and prioritizing threads to make parallel execution more efficient -- much more efficient,
we think. Although it’s difficult to isolate in testing, Power5’s implementation should outgun the maximum 30 percent boost
that Intel projects for HTT.
Power5 adds two basic, but much-needed, thread-prioritization schemes. Dynamic Resource Balancing attempts to keep instruction
streams flowing smoothly by analyzing the behavior of threads and by sidelining code that could slow down an SMT stream. For
example, instructions that must be executed in sequence to derive an accurate result can lock that thread in the processor
for a time. Power5 tries to predict this and run simpler instructions until there’s room to execute the sequence without clogging
SMT.
In another awesome design gain, Power5’s adjustable thread priority gives OSes, drivers, and applications the capability of
assigning an arbitrary priority level to each thread. This application-defined thread priority is factored into Dynamic Resource
Balancing calculations and is used more broadly to determine the length of time a thread remains active in the CPU. It also
gives operating systems an easy way to control power conservation.
If you’ve got a lot of high-priority threads running, the box will run hot. But as the OS knocks thread priorities down, the
CPU will run more idle cycles and therefore run cooler. If you knock all thread priorities down to their lowest level, the
CPU goes into a sleeplike low-power mode. That’s the simplest approach to power management we can imagine.