sponsored

Understanding Why Data Layout Matters

Alternative Data Layouts for Performance: Intel MKL with Compact Matrix Functions

istock 476463845
istock/unewind

The impact of data layout on performance can be significant. One opportunity to realize better performance exists in an important new capability of the latest version of the Intel® Math Kernel Library. This feature offers significant speed-up for processing small matrices even when the overhead of data format conversions is taken into account.

There’s really nothing sacred about the “standard layouts” for data other than their legacy. They just weren’t designed with caches and SIMD (single instruction, multiple data) instructions in mind. Therefore, it’s interesting to see that we can get speed-ups with simple code changes to a compact data layout even for a small portion of an application.

Understanding why data layout matters

Are you familiar with the question, “Is it a bug or a feature?” Data layout inspires this question frequently because application performance can be highly dependent on the layout of data in memory. It’s been this way for a few decades now, but it hasn’t always been this way. What changed? As more transistors became available to silicon designers, they added clever tricks to speed up common operations. Several of these tricks are based on the observation that applications that access a memory location tend to access memory locations that immediately follow in memory. Additionally, such applications tend to reuse the data from those memory locations a few times before being done with it.

These two observations led to caching, which is the act of holding on to data that has been recently used in a way that is closer to the computational core of a processor and is faster for the computational core to access than accessing the main memory. Another trick was to grab and hold onto many bytes of memory for every single memory access that was requested. This led to the concept of a cache line which is commonly 64 bytes long. If any byte in a cache line is read, the entire cache line is fetched from memory and held in cache. This greatly accelerates accesses to the other bytes as long as the data remains in cache. Caches stop holding onto data, in favor of holding more recently used data, in roughly “this data hasn’t been used for the longest time” ordering (“least recently used” in computer jargon).

Additionally, some instructions exist to work on cache lines, or large parts of them, in parallel. These vector instructions (also called SIMD instructions) only work on data that is sequential in memory, or a cache line. These instructions, like the caching, speed up code that has data in a sequential order in memory and do not speed up other data layouts.

These speed-ups are clearly handy features, but the dark side is that random data usage can be much slower than sequential data usage. It’s not that the design slowed down those accesses, it’s just that the design sped up sequential accesses only (not random accesses to memory). Hence the question: Is it a bug or a feature? Let’s say it’s a feature that encourages us to think about rearranging our programs to make use of these capabilities. When we don’t use them, we get lower performance than the machine is capable of getting.

‘Compact data layouts’

Many high-performance computing applications depend on matrix computations performed on large groups of very small matrices. The latest version of Intel Math Kernel Library provides new compact functions that include vectorization-based optimizations for problems of this type.

Reorganizing data in a “compact data layout” allows Intel MKL to rely on true SIMD matrix computations. Intel MKL can therefore provide significant performance benefits using these compact data formats. With the right planning, the precise data format you use in an application should not affect a great deal of the application code – and can be controlled by conditional compilation or runtime selections.

Intel MKL (2018 and later) has seven compact functions:

  1. General matrix-matrix multiply
  2. Triangular matrix equation solve
  3. LU factorization (without pivoting)
  4. Inverse calculation (from LU without pivoting)
  5. Cholesky factorization
  6. QR factorization
  7. Service functions to facilitate the packing and unpacking of groups of matrices into the compact data layout

Dramatic speed-ups

Operations on very small matrices can see 20X changes in the time to do matrix operations while not-quite-as-small matrices see a 2X improvement. That’s assuming we use the compact data layout. There is overhead to convert to and from the compact data layout. Even with the overhead, we can see significant application overhead. The more processing that we do while using the compact data layout, the faster our code will run.

Intel, in a recent article, showed the results of a program that calculates the inverse from a non-pivoting LU factorization. The following figure shows the speed-up obtained by changing a program to convert to the compact format, do the calculations, and then convert the answer back to a standard format. As shown, even when including overhead in performance, it can still provide consistently good speed-up, with some cases demonstrating up to 4X speed-up compared to calls to the Intel MKL using standard layouts of data.

standardlayouts Intel

Speed-up comparing apples to apple (including conversion costs), Source: Intel

Small matrices benefit from compact layouts

When processing small matrices, it’s well worth looking at Intel’s innovative compact data layout functions. Code changes are limited to changing the name of Intel MKL functions to the compact layout versions and inserting two conversion calls. The resulting speed-ups can be quite significant, because data layouts matter.

Resources:

For more in-depth coverage of this Intel MKL feature, see Speeding Algebra Computations with Intel® Math Kernel Library Vectorized Compact Matrix Functions

Download Intel Math Kernel Library for free today

Related: