Thanks to Intel, I just got a 20X speed-up in Python that I can turn on and off with a single command. And this wasn’t even in ideal conditions. but in a virtual environment: openSUSE Linux (Tumbleweed) running on a VBox on my quad-core iMac. What I did can be done on Windows, Linux, or OS X. Intel doesn’t list openSUSE on their list of tested Linux configurations (SUSE Enterprise is on the list), but it worked perfectly for me.

Here’s how I did it:

1. Download the Anaconda command-line installer from https://www.continuum.io/downloads.

2. Install it (per their web page): % bash Anaconda2-4.3.0-Linux-x86_64.sh

3. Install Intel’s acceleration, as a separate “environment” that I can turn on and off: % conda config --add channels intel % conda create --name intelpy intelpython2_full python=2

4. Run my sample program and see speed-up of 15X to 20X on my openSUSE VBox setup. % source deactivate intelpy

% python < myprog.py

np.sin

102400 10000 36.1987440586

np.cos

102400 10000 36.1938228607

np.tan

102400 10000 51.487637043

% source activate intelpy

% python < myprog.py

np.sin

102400 10000 1.76131296158

np.cos

102400 10000 1.83870100975

np.tan

102400 10000 3.38778400421

That’s all! The speed-ups are 20.6X, 19.7X, and 15.2X in this quick test running on a virtual machine.

Here’s my little Python program:

% cat myprog.py

import numpy as np

import time

N = 102400

x = np.linspace(0.0123, 4567.89, N)

def mine(x,Z,func,name):

print name;

start = time.time()

for z in range ( 0, Z ) :

y = func(x);

end = time.time()

print N, Z, end - start

return

mine(x,10000,np.sin,'np.sin')

mine(x,10000,np.cos,'np.cos')

mine(x,10000,np.tan,'np.tan')

The program is something I threw together quickly to check out Intel’s claims to have accelerated transcendental expressions in NumPy. Cosine, sine, and tangent were the transcendentals I could remember from my TI calculator days, so I tried them. I decided to do a little more than a billion of each by running a function on more than 100,000 numbers and repeating that 10,000 times. (A good test for speed-up even if not a particularly interesting program.)

**Accelerated Python at will**

I previously wrote about how “accelerated Python” has made Python worth another look for big data and high-performance computing applications. In addition to the news that accelerated Python is even faster, I’ve shown above how easy it is to use Conda to turn the acceleration ON and OFF. This is very cool, and helps make the decision to install even safer – because it remains optional. (Note: Anaconda is a collection of many packages for Python, and Conda is a package manager. I use them both and like them a lot.)

I used “conda create” to create an environment that I called intelpy. Then, I could activate and deactive it with “source activate intelpy” and “source deactivate intelpy.”

The substantial new acceleration capabilities that have been released by Intel make an even more convincing case for accelerated Python.

It’s important to note that accelerated Python is simply using a faster set of Python libraries, and requires no changes to our code. Of course, our Python code has to be using something that is accelerated in order to benefit from this.

Intel gets this acceleration by focusing on three things:

- Taking advantage of multicore
- Taking advantage of vector (also called SIMD) instructions such as SSE, AVX, AVX2, and AVX-512
- Using advanced algorithms in the Intel® Math Kernel Library (Intel® MKL)

All three of these happen in programs that operate on vectors or matrices. We shouldn’t expect big speed-ups for an occasional standalone cosine. Nor should we expect as much speed-up on a single core as on multicore. Of course, Intel’s 72-core processor, the Intel® Xeon Phi™ processor, will lead many benchmarks when lots of cores can help. In my case, my virtual machine was set up to use only four cores on my i5-based iMac.

**My quick ****FFT program gets 8X on my four-core Virtual Machine**

I also gave Fast Fourier Transforms (FFTs) a try. Using the same setup as with my original program, I simply ran my FFT program as follows:

% source deactivate intelpy

% python < myfftprog.py

fft

5000 2.22796392441

fft

7000 8.74916005135

% source activate intelpy

% python < myfftprog.py

fft

5000 0.277629137039

fft

7000 1.11230897903

The speed-ups are 8.0X and 7.9X, again running openSUSE using four cores on a VBox on my iMac. Here’s my quick FFT program:

% cat myfftprog.py

import numpy as np

import numpy.random as rn

import time

def trythis(Z):

mat = rn.rand(Z,Z) + 1j * rn.randn(Z,Z)

print 'fft'

start = time.time()

# 2D transform on a complex-valued matrix:

result = np.fft.fft2(mat)

end = time.time()

print Z, end - start

return

trythis(5000);

trythis(7000);

**Newly accelerated**

Back to the new accelerations. Here’s a run-down on what is newly accelerated in the latest “update 2” from Intel:

**Optimized arithmetic and transcendental expressions in NumPy**

The transcendentals include the cosine, sine, and tangent that I took for a spin in my quick example program. The key to these optimizations are changes in NumPy that allow primitives (which do operations on ndarray data) to selectively use the capabilities of the Intel MKL Short Vector Math Library (SVML) and the Intel MKL Vector Math Library (VML). This lets Python use the latest vector capabilities of processors, including multicore optimizations and AVX/AVX2/AVX-512. The Intel team says they’ve seen the performance of NumPy Arithmetic and transcendental operations on vector-vector and vector-scalar accelerated up to 400x on Intel Xeon Phi processors.

**Optimized Fast Fourier Transforms in NumPy and SciPy FFT**

The key to these optimizations is the Intel MKL, with its native optimizations for FFT as needed by a range of NumPy and SciPy functions. The optimizations include real and complex data types, both single and double precision, for one-dimensional and multidimensional data, in place and out of place. The Intel team says they’ve seen performance improve up to 60x from this update, which now lets Python performance rival that of a native C/C++ program using Intel MKL directly.

**Optimized memory management**

Python is a dynamic language and it manages memory for the user. Performance of Python applications depend a great deal on the performance of memory operations, including allocation, de-allocation, copy, and move. The accelerated Python from Intel now ensures best alignment when NumPy allocates arrays, so that NumPy and SciPy compute functions can benefit from respective aligned versions of SIMD memory access instructions. Intel says the biggest gains come from optimizations to memory copy and move operations.

**Faster – and easy to turn on and off with Conda**

The latest accelerated Python from the Anaconda Intel Channel (or Intel direct) delivers significant performance optimizations for Python programs without requiring code changes. And it’s all very easy to download and install.

And I really love how Conda lets me turn it on and off. That’s great for comparisons, and for peace of mind in case I’m hesitant to completely switch to these super-fast math functions from Intel.

**Learning more**

To dive in deeper, here are some links:

- More detailed instructions on how to install Intel's accelerated Python with Anaconda
- Official Intel blog about Update 2 of their "accelerated Python"

**Get the Intel® Distribution for Python* Now - Free Download**