sponsored

Why Effective Parallel Programming Must Include Scalable Memory Allocation

istock 464457370
Credit: iStock/photoart23D

Multicore processor? Yes.

Write program to run in parallel? Yes.

Did you remember to use a Scalable Memory Allocator? No? Then read on …

In my experience, making sure “memory allocation” for a program is ready for parallelism is an often-overlooked element of making a parallel program work well. I can show you an incredibly easy way to see if this is a problem for a compiled program (C, C++, Fortran, etc.) – as well as how to fix it.

A critical part of any parallel program is scalable memory allocation, which includes use of new as well as explicit calls to malloc, calloc, or realloc. Options include TBBmalloc (Intel Threading Building Blocks), jemalloc, and tcmalloc. TBBmalloc has a novel “proxy” feature that makes it easy to try in less than 5 minutes on any compiled program.

The performance benefits of using a scalable memory allocator are significant. TBBmalloc was among the first widely used scalable memory allocators, in no small part because it came free with TBB to help highlight the importance of including memory allocation considerations in any parallel program.  It remains extremely popular today, and is still one of the best scalable memory allocators available.

An easy solution without any code changes

Using the proxy methods, we can globally replace new/delete and malloc/calloc/realloc/free/etc. routines with a dynamic memory interface replacement technique. This automatic way to replace the default functions for dynamic memory allocation is by far the most popular way to use TBBmalloc. It’s easy and sufficient for most programs.

The details of the mechanism used on each operating system vary a bit, but the net effect is the same everywhere.

We start our 5-minute trial by downloading and installing Threading Building Blocks (free from http://threadingbuildingblocks.org; it’s also included as part of Intel Parallel Studio products).

Use proxy on Linux

On Linux, we can do the replacement either by loading the proxy library at program load time using the LD_PRELOAD environment variable (without changing the executable file), or by linking the main executable file with the proxy library (-ltbbmalloc_proxy). The Linux program loader must be able to find the proxy library and the scalable memory allocator library at program load time. For that we may include the directory containing the libraries in the LD_LIBRARY_PATH environment variable or add it to /etc/ld.so.conf.

Try as follows:

time ./a.out (or whatever our program is called)

export LD_PRELOAD=libtbbmalloc_proxy.so.2

time ./a.out (or whatever our program is called)

Use proxy on macOS

On macOS, we can do the replacement either by loading the proxy library at program load time using the DYLD_INSERT_LIBRARIES environment variable (without changing the executable file), or by linking the main executable file with the proxy library (-ltbbmalloc_proxy). The macOS program loader must be able to find the proxy library and the scalable memory allocator library at program load time. For that, we may include the directory containing the libraries in the DYLD_LIBRARY_PATH environment variable.

Try as follows:

time ./a.out (or whatever our program is called)

export DYLD_INSERT_LIBRARIES=$TBBROOT/lib/libtbbmalloc_proxy.dylib

time ./a.out (or whatever our program is called)

Use proxy on Windows

On Windows, we must modify our executable. We can either force the proxy library to be loaded by adding an #include "tbb/tbbmalloc_proxy.h" in our source code, or using certain linker options when building the executable:

For win32:

            tbbmalloc_proxy.lib /INCLUDE:"___TBB_malloc_proxy"

For win64:

            tbbmalloc_proxy.lib /INCLUDE:"__TBB_malloc_proxy"

The Windows program loader must be able to find the proxy library and the scalable memory allocator library at program load time. For that, we may include the directory containing the libraries in the PATH environment variable. Try it out by using the Visual Studio “Performance Profiler” to time the program with and without the include or link option.

Testing our proxy library usage with a small program

I encourage you to try with your own program as described above. Run with and without the proxy, and see how much benefit your application gets. Applications with lots of parallelism and lots of memory allocations often see 10-20% boosts (I’ve seen a 400% boost once too), while programs with little parallelism or few allocations may see no effect at all. The quick tests, described previously, with the proxy library will tell you which category your application is in.

I’ve also written a short program to illustrate the effects as well as to provide an easy way to check that things are installed and working as expected. We can try the proxy library with a simple program:

#include <stdio.h>

#include "tbb/tbb.h"

using namespace tbb;

const int N = 1000000;

int main() {

double *a[N];

parallel_for( 0, N-1, [&](int i) { a[i] = new double; } );

parallel_for( 0, N-1, [&](int i) { delete a[i];       } );

return 0;

}

My example program does use a lot of stack space, so “ulimit –s unlimited” (Linux/macOS) or “/STACK:10000000” (Visual Studio: Properties > Configuration Properties > Linker > System > Stack Reserve Size) will be important to avoid immediate crashes.

After compiling, here are the various ways I ran my little program to see the speed with and without the proxy library.

Running and timing tbb_mem.cpp on a quadcore virtual Linux machine, I saw the following:

% time ./tbb_mem 

real       0m0.160s

user       0m0.072s

sys        0m0.048s

%

% exportLD_PRELOAD=$TBBROOT/lib/libtbbmalloc_proxy.dylib

%

% time ./tbb_mem 

real       0m0.043s

user       0m0.048s

sys        0m0.028s

Running and timing tbb_mem.cpp on a quadcore iMac (macOS) , I saw the following:

% time ./tbb_mem

real       0m0.046s

user       0m0.078s

sys        0m0.053s

%

% export DYLD_INSERT_LIBRARIES=$TBBROOT/lib/libtbbmalloc_proxy.dylib

%

% time ./tbb_mem 

real       0m0.019s

user       0m0.032s

sys        0m0.009s

On Windows, using the Visual Studio “Performance Profiler” on a quadcore Intel NUC (Core i7) I saw times of 94ms without the scalable memory profiler and 50ms with it (adding #include "tbb/tbbmalloc_proxy.h"into the example program).

Compilation considerations

Personally I haven’t had a problem with compilers doing “malloc optimizations,” but technically I would suggest that when compiling with programs that such compiler “malloc optimizations” should be disabled. It might be wise to check the compiler documentation of your favorite compiler. For instance, with the Intel compilers or gcc, it’s best to pass in the following flags:

-fno-builtin-malloc (on Windows: /Qfno-builtin-malloc)

-fno-builtin-calloc (on Windows: /Qfno-builtin-calloc)

-fno-builtin-realloc (on Windows: /Qfno-builtin-realloc)

-fno-builtin-free (on Windows: /Qfno-builtin-free)

Failure to use these flags may not cause a problem, but it’s not a bad idea to be safe.

Summary

Using a scalable memory allocator is an essential element in any parallel program. I’ve shown that TBBmalloc can be easily injected without requiring code changes (although adding an “include” on Windows is my favorite Windows solution). You might see a nice speed-up with only 5 minutes of work, and you can apply it to multiple applications easily. On Linux and macOS you might even be able to speed-up programs without having the source code!

Click here to download your free 30-day trial of Intel Parallel Studio XE.

Related: