6 Python libraries for parallel processing

Want to distribute that heavy Python workload across multiple CPUs or a compute cluster? These frameworks can make it happen

6 Python libraries for parallel processing
graemenicholson / Getty Images
Table of Contents
Show More

Python is long on convenience and programmer-friendliness, but it isn’t the fastest programming language around. Some of its speed limitations are due to its default implementation, cPython, being single-threaded. That is, cPython doesn’t use more than one hardware thread at a time.

And while you can use the threading module built into Python to speed things up, threading only gives you concurrency, not parallelism. It’s good for running multiple tasks that aren’t CPU-dependent, but does nothing to speed up multiple tasks that each require a full CPU. 

Python does include a native way to run a Python workload across multiple CPUs. The multiprocessing module spins up multiple copies of the Python interpreter, each on a separate core, and provides primitives for splitting tasks across cores. But sometimes even multiprocessing isn’t enough.

Sometimes the job calls for distributing work not only across multiple cores, but also across multiple machines. That’s where these six Python libraries and frameworks come in. All six of the Python toolkits below allow you to take an existing Python application and spread the work across multiple cores, multiple machines, or both.


Developed by a team of researchers at the University of California, Berkeley, Ray underpins a number of distributed machine learning libraries. But Ray isn’t limited to machine learning tasks alone, even if that was its original use case. Any Python tasks can be broken up and distributed across systems with Ray.

Ray’s syntax is minimal, so you don’t need to rework existing apps extensively to parallelize them. The @ray.remote decorator distributes that function across any available nodes in a Ray cluster, with optionally specified parameters for how many CPUs or GPUs to use. The results of each distributed function are returned as Python objects, so they’re easy to manage and store, and the amount of copying across or within nodes is kept to a minimum. This last feature comes in handy when dealing with NumPy arrays, for instance.

Ray even includes its own built-in cluster manager, which can automatically spin up nodes as needed on local hardware or popular cloud computing platforms.


From the outside, Dask looks a lot like Ray. It, too, is a library for distributed parallel computing in Python, with its own task scheduling system, awareness of Python data frameworks like NumPy, and the ability to scale from one machine to many.

Dask works in two basic ways. The first is by way of parallelized data structures — essentially, Dask’s own versions of NumPy arrays, lists, or Pandas DataFrames. Swap in the Dask versions of those constructions for their defaults, and Dask will automatically spread their execution across your cluster. This typically involves little more than changing the name of an import, but may sometimes require rewriting to work completely.

The second way is through Dask’s low-level parallelization mechanisms, including function decorators, that parcel out jobs across nodes and return results synchronously (“immediate” mode) or asynchronously (“lazy”). Both modes can be mixed as needed, too.

One key difference between Dask and Ray is the scheduling mechanism. Dask uses a centralized scheduler that handles all tasks for a cluster. Ray is decentralized, meaning each machine runs its own scheduler, so any issues with a scheduled task are handled at the level of the individual machine, not the whole cluster.

Dask also offers an advanced and still experimental feature called “actors.” An actor is an object that points to a job on another Dask node. This way, a job that requires a lot of local state can run in-place and be called remotely by other nodes, so the state for the job doesn’t have to be replicated. Ray lacks anything like Dask’s actor model to support more sophisticated job distribution.


Dispy lets you distribute whole Python programs or just individual functions across a cluster of machines for parallel execution. It uses platform-native mechanisms for network communication to keep things fast and efficient, so Linux, MacOS, and Windows machines work equally well.

Dispy syntax somewhat resembles multiprocessing in that you explicitly create a cluster (where multiprocessing would have you create a process pool), submit work to the cluster, then retrieve the results. A little more work may be required to modify jobs to work with Dispy, but you also gain precise control over how those jobs are dispatched and returned. For instance, you can return provisional or partially completed results, transfer files as part of the job distribution process, and use SSL encryption when transferring data.


Pandaral·lel, as the name implies, is a way to parallelize Pandas jobs across multiple nodes. The downside is that Pandaral·lel works only with Pandas. But if Pandas is what you’re using, and all you need is a way to accelerate Pandas jobs across multiple cores on a single computer, Pandaral·lel is laser-focused on the task.

Note that while Pandaral·lel does run on Windows, it will run only from Python sessions launched in the Windows Subsystem for Linux. MacOS and Linux users can run Pandaral·lel as-is. 


Ipyparallel is another tightly focused multiprocessing and task-distribution system, specifically for parallelizing the execution of Jupyter notebook code across a cluster. Projects and teams already working in Jupyter can start using Ipyparallel immediately.

Ipyparallel supports many approaches to parallelizing code. On the simple end, there’s map, which applies any function to a sequence and splits the work evenly across available nodes. For more complex work, you can decorate specific functions to always run remotely or in parallel.

Jupyter notebooks support “magic commands” for actions only possible in a notebook environment. Ipyparallel adds a few magic commands of its own. For example, you can prefix any Python statement with %px to automatically parallelize it.


Joblib has two major goals: run jobs in parallel and don’t recompute results if nothing has changed. These efficiencies make Joblib well-suited for scientific computing, where reproducible results are sacrosanct. Joblib’s documentation provides plenty of examples for how to use all its features.

Joblib syntax for parallelizing work is simple enough—it amounts to a decorator that can be used to split jobs across processors, or to cache results. Parallel jobs can use threads or processes.

Joblib includes a transparent disk cache for Python objects created by compute jobs. This cache not only helps Joblib avoid repeating work, as noted above, but can also be used to suspend and resume long-running jobs, or pick up where a job left off after a crash. The cache is also intelligently optimized for large objects like NumPy arrays. Regions of data can be shared in-memory between processes on the same system by using numpy.memmap.

One thing Joblib does not offer is a way to distribute jobs across multiple separate computers. In theory it’s possible to use Joblib’s pipeline to do this, but it’s probably easier to use another framework that supports it natively. 

Read more about Python

Copyright © 2020 IDG Communications, Inc.