How to speed your code using CPU caches

Techniques include using cache-friendly structs provides a huge performance gain

How to speed your code using CPU caches
Thinkstock

The CPU’s cache reduces memory latency when data is accessed from the main system memory. Developers can and should take advantage of CPU cache to improve application performance.

How CPU caches work

Modern CPUs typically have three levels of cache, labeled L1, L2, and L3, which reflects the order in which the CPU checks them. CPUs often have a data cache, an instruction cache (for code), and a unified cache (for anything). Accessing these caches are much faster than accessing the RAM: Typically, the L1 cache is about 100 times faster than the RAM for data access, and the L2 cache is 25 times faster than RAM for data access.

When your software runs and needs to pull in data or instructions, the CPU caches are checked first, then the slower system RAM, and finally the much slower disk drives. That’s why you want to optimize your code to seek what is likely to be needed from the CPU cache first.

Your code can’t specify where data instructions and data reside—the computer hardware does that—so you can't force certain elements into the CPU cache. But you can optimize your code to retrieve the size of the L1, L2, or L3 cache in your system using Windows Management Instrumentation (WMI) to optimize when your application accesses the cache and thus its performance.

CPUs never access the cache byte by byte. Instead, they read the memory in cache lines, which are chunks of memory generally 32, 64, or 128 bytes in size.

The following code listing illustrates how you can retrieve the L2 or L3 CPU cache size in your system:

public static uint GetCPUCacheSize(string cacheType)
{
     try
     {
          using (ManagementObject managementObject = new ManagementObject("Win32_Processor.DeviceID='CPU0'"))
          {
               return (uint)(managementObject[cacheType]);
          }
     }
     catch
          {
               return 0;
          }
}
static void Main(string[] args)
     {
     uint L2CacheSize = GetCPUCacheSize("L2CacheSize");
     uint L3CacheSize = GetCPUCacheSize("L3CacheSize");
     Console.WriteLine("L2CacheSize: " + L2CacheSize.ToString());
     Console.WriteLine("L3CacheSize: " + L3CacheSize.ToString());
     Console.Read();
}

Microsoft has additional documentation on the Win32_Processor WMI class.

Programming for performance: Example code

When you have objects in the stack, there isn’t any garbage collection overhead. If you are using heap-based objects, there is always a cost involved with the generational garbage collection for collecting or moving objects in the heap or compacting the heap memory. A good way to avoid garbage collection overhead is to use structs instead of classes.

Caches work best if you are using a sequential data structure, such as an array. Sequential ordering lets the CPU can read ahead and also read ahead speculatively in anticipation of what is likely to be requested next. Thus, an algorithm that accesses the memory sequentially is always fast.

If you access memory in a random order, the CPU needs new cache lines every time you access memory. That reduces performance.

The following code snippet implements a simple program that illustrates the benefits of using a struct over a class:

 struct RectangleStruct    
{
     public int breadth;
     public int height;
}
class RectangleClass
{
     public int breadth;
     public int height;
}

The following code profiles the performance of using an array of structs against an array of classes. For illustration purposes, I’ve used a million objects for both, but typically you don’t need that many objects in your application.

static void Main(string[] args)
{
     const int size = 1000000;
     var structs = new RectangleStruct[size];
     var classes = new RectangleClass[size];
     var sw = new Stopwatch();
     sw.Start();
     for (var i = 0; i < size; ++i)
     {
          structs[i] = new RectangleStruct();
          structs[i].breadth = 0
          structs[i].height = 0;
     }
var structTime = sw.ElapsedMilliseconds;
sw.Reset();
sw.Start();
for (var i = 0; i < size; ++i)
     {
          classes[i] = new RectangleClass();
          classes[i].breadth = 0;
          classes[i].height = 0;
     }
     var classTime = sw.ElapsedMilliseconds;
     sw.Stop();
     Console.WriteLine("Time taken by array of classes: "+ classTime.ToString() + " milliseconds.");
     Console.WriteLine("Time taken by array of structs: " + structTime.ToString() + " milliseconds.");
     Console.Read();
}

The program is simple: It creates 1 million objects of structs and stores them in an array. It also creates 1 million objects of a class and stores them in another array. The properties' breadth and height are assigned a value of zero on each instance.

As you can see, using cache-friendly structs provides a huge performance gain.

Rules of thumb for better CPU cache usage

So, how do you write code that best uses the CPU cache? Unfortunately, there is no magic formula. But there are some rules of thumb:

  • Avoid using algorithms and data structures that exhibit irregular memory access patterns; use linear data structures instead.
  • Use smaller data types and organize the data so there aren't any alignment holes.
  • Consider the access patterns and take advantage of linear data structures.
  • Improve spatial locality, which uses each cache line to the maximum extent once it has been mapped to a cache.