SIMD Intrinsics Aren’t So Scary, but Should We Use Them?

istock 503304320
Credit: istock

Is low-level programming a sin or a virtue? It depends.

When programming for using vector processing on a modern processor, ideally I’d write some code in my favorite language and it would run as fast as possible “auto-magically.”

Unless you just started programming last week, I suspect you know that’s not how the world works. Top performance only comes with effort. Hence my question: how low should we go?

Vector operations defined

A “vector” operation is a math operation that does more than one operation. A vector add might add eight pairs of numbers instead of the regular add, which only adds one pair of numbers. Consider asking the computer to add two numbers together. We can do that with a regular add instruction. Consider asking the computer to add eight pairs of numbers to each other (compute C1=A1+B1, C2=A2+B2, … C8=A8+B8). We can do that with a vector add instruction.

Vector instructions include addition, subtraction, multiplication, and other operations.

 SIMD: parallelism for vectors

Computer scientists have a fancy name for vector instructions: SIMD, or “Single Instruction Multiple Data.” If we think of a regular add instruction as a SISD (Single Instruction Single Data) where single means a single pair of data inputs, then a vector add is a SIMD where multiple could mean eight pairs of data inputs.

I like to call SIMD “the other hardware parallelism,” since “parallelism” in computers is so often thought of as coming from having multiple cores. Core counts have steadily increased. Core counts of four are common, 20 or more are common in processors for servers, and Intel’s top core count today is 72 cores in a single Intel® Xeon Phi™ processor.

Vector instruction sizes have rise, too. Early vector instructions, such as SSE, performed up to four operations at a time. Intel’s top vector width today, in AVX-512, performs up to 16 operations at a time.

 How low should we go?

With so much performance at stake, how much work should we do to exploit this performance?

 The answer is a lot, and here’s why: Four cores can get us 4X speed-up at the most. AVX (half the size of AVX-512, but much more common) can get us up to 8X speed-up at the most. Combined, they can get up to 32X. Doing both makes a lot of sense.

Here’s my simple list of how to try to exploit vector instructions (in the order we should try to apply them):

 1.     First, call a library that does the work (the ultimate in implicit vectorization). An example of such a library is the Intel® Math Kernel Library (Intel® MKL). All the work to use vector instructions was done by someone else. The limitations are obvious: We have to find a library that does what we need.

2.     Second, use implicit vectorization. Stay abstract and write it yourself using templates or compilers to help. Many compilers have vectorization switches and options. Compilers are likely to be the most portable and stable way to go. There have been many templates for vectorization, but none has seen enough usage over time to be a clear winner (a recent entry is Intel® SIMD Data Layout Templates [Intel® SDLT]).

3.     Third, use explicit vectorization. This has become very popular in recent years, and tries to solve the problem of staying abstract but forcing the compiler to use vector instructions when it would not otherwise use them. The support for SIMD in OpenMP is the key example here, where vectorization requests for the compiler are given very explicitly. Non-standard extensions exist in many compilers, often in the form of options or “pragmas.” If you take this route, OpenMP is the way to go if you are in C, C++, or Fortran.

4.     Finally, get low and dirty. Use SIMD intrinsics. It’s like assembly language, but written inside your C/C++ program. SIMD intrinsics actually look like a function call, but generally produce a single instruction (a vector operation instruction, also known as a SIMD instruction).

SIMD intrinsics aren’t evil; however, they are a last resort. The first three choices are always more maintainable for the future when they work. However, when the first three fail to meet our needs, we definitely should try using SIMD intrinsics.

 If you want to get started using SIMD intrinsics, you’ll have a serious leg up if you’re used to assembly language programming. Mostly this is because you’ll have an easier time reading the documentation that explains the operations, including Intel’s excellent online “Intrinsics Guide.” If you’re completely new to this, I ran across a recent blog (“SSE: mind the gap!”) that has a gentle hand in introducing intrinsics. I also like “Crunching Numbers with AVX and AVX2.”

 If a library or compiler can do what you need, SIMD intrinsics aren’t the best choice. However, they have their place and they aren’t hard to use once you get used to them. Give them a try. The performance benefits can be amazing. I’ve seen SIMD intrinsics used by clever programmers for code that no compiler is likely to produce.

Even if we try SIMD intrinsics, and eventually let a library or compiler do the work, what we learn can be invaluable in understanding the best use of a library or compiler for vectorization. And that may be the best reason to try SIMD intrinsics the next time we need something to use vector instructions.

Click here to download your free 30-day trial of Intel Parallel Studio XE