MXNet review: Amazon's scalable deep learning

Amazon’s favorite deep learning framework scales across multiple GPUs and hosts, but it's rough around the edges

Deep learning, which is basically neural network machine learning with multiple hidden layers, is all the rage—both for problems that justify the complexity and high computational cost of deep learning, such as image recognition and natural language parsing, and for problems that might be better served by careful data preparation and simple algorithms, such as forecasting the next quarter’s sales. If you actually need deep learning, there are many packages that could serve your needs: Google TensorFlow, Microsoft Cognitive Toolkit, Caffe, Theano, Torch, and MXNet, for starters.

I confess that I had never heard of MXNet (pronounced “mix-net”) before Amazon CTO Werner Vogels noted it in his blog. There he announced that in addition to supporting all of the deep learning packages I mentioned above, Amazon decided to contribute significantly to one in particular, MXNet, which it selected as its deep learning framework of choice. Vogels went on to explain why: MXNet combines the ability to scale to multiple GPUs (across multiple hosts) with good programmability and good portability.

MXNet originated at Carnegie Mellon University and the University of Washington. It is now developed by collaborators from multiple universities and many companies, including the likes of Amazon, Baidu, Intel, Microsoft, Nvidia, and Wolfram. MXNet allows you to mix symbolic programming (declaration of the computation graph) and imperative programming (straight tensor operations) to maximize both efficiency and productivity.

The MXNet platform is built on a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly, although you have to tell it what GPU and CPU cores to use. A graph optimization layer on top of the scheduler makes symbolic execution fast and memory efficient.

MXNet currently supports building and training models in Python, R, Scala, Julia, and C++; trained MXNet models can also be used for prediction in Matlab and JavaScript. No matter what language you use for building your model, MXNet calls an optimized C++ back-end engine.

mxnet overview

An overview of the MXNet architecture: NDArrays are representations of tensors. The KVStore is a distributed key-value store for data synchronization over multiple devices.

MXNet features

In the original paper on MXNet, the authors explain why they combined the symbolic declaration of a computation graph, which allows more opportunity for optimization of neural network solutions, and imperative programming of tensors, which allows for more flexibility, more natural variation of parameters, and easier debugging. They went to some effort to bind their API to multiple programming languages and to implement autodifferentiation to derive gradients.

The authors consider the MXNet API a superset of what’s offered in Torch, Theano, Chainer, and Caffe, while offering more portability and support for GPU clusters. They consider MXNet similar to TensorFlow but with the ability to embed imperative tensor operations.

The MXNet architecture, shown in the figure above, has five user-facing modules and five system modules. The SimpleOp module, which is not shown in the diagram, has been split off from the Operator module, which is shown. That doesn’t really matter to a programmer, however. The Operator module includes operators that define static forward and gradient calculation. SimpleOp includes operators that extend to NDArray operators and symbolic operators in a unified fashion. 

Looking at the Python API, there are six high-level interfaces:

  • The Module API is a flexible high-level interface for training neural networks.
  • The Model API is an alternative, simple high-level interface for training neural networks.
  • The Symbolic API performs operations on NDArrays to assemble neural networks from layers.
  • The IO Data Loading API performs parsing and data loading.
  • The NDArray API performs vector/matrix/tensor operations.
  • The KVStore API performs multi-GPU and multihost distributed training.

The MXNet shared back-end library is written in C++ with the standard C++ library for efficiency and portability. The language embeddings are written in the target languages Python, R, and Scala, with some shims in C++ for the R API.

A fair chunk of the engine code provides support for Nvidia CUDA general-purpose GPUs. For the highest performance, build and run MXNet to use the CUDA SDK and CUDA Deep Neural Network (CUDNN) library. As the MXNet authors noted when comparing their implementation with competitors on the convnet-benchmarks, most deep neural network computations are spent on the CUDA/CUDNN kernels. At the time the original paper was written, TensorFlow used an older version of CUDNN than the other packages and, therefore, ran slower; that has since been corrected.

When researching scalability for that paper, the MXNet authors ran a GoogLeNet benchmark (a 22-layer incarnation of the Inception convolutional neural network for image object detection and classification) on one and 10 Amazon EC2 g2.8xlarge instances, and they showed a superlinear speedup. When Amazon tested a MXNet implementation of the related Inception v3 algorithm on P2.16xlarge instances for varying numbers of GPUs, the results showed a scaling efficiency of 85 percent of the number of GPUs in use, as shown in the figure below. 

Those are impressive results. The p2.16xlarge instances have 16 Nvidia Tesla K80 GPUs (eight boards), and Amazon ran the scaling test out to 128 GPUs. That’s cluster performance.

I haven’t tried to reproduce either experiment. Your mileage may vary and will depend heavily on your neural network design (the more layers, the longer it takes) and choice of optimizer (some converge better than others for specific algorithms and data sets).

mxnet scaling Amazon

Amazon tested an Inception v3 algorithm implemented in MXNet on P2.16xlarge instances and found a scaling efficiency of 85 percent.

MXNet installation

MXNet supports training on Linux, OS X, Windows, and Docker; it also supports prediction from JavaScript in a browser and on an iOS or Android device using an amalgamated C++ package.

I did my initial installation on a MacBook Pro with a Nvidia GeForce GT 650M GPU and both CUDA and CUDNN installed. Based on previous experience with TensorFlow, I initially built MXNet for the CPU, not the GPU.

On a Mac, you first use Homebrew to install some prerequisites, then clone MXNet from GitHub, copy the Mac makefile, and build the shared library. After some initial fumbles, it took me about eight minutes to compile and link the library. Once the library has been built, you need to install the language packages you want to use. If you want to use Jupyter notebooks with Python to run the MXNet samples, you need to install Jupyter as well, although the MXNet instructions don’t mention that.

Using the Amazon Deep Learning AMI, which has MXNet and four other deep learning packages pre-installed along with all of the required languages and the Anaconda Python library package (including Jupyter), can save you the effort of building the packages from source or even downloading them. Using the older Amazon G2 (K520 GPUs), newer P2 instances (K80 GPUs), or Microsoft Azure NC instances (K80 GPUs) will enable you to run MXNet on serious floating point processors. IBM SoftLayer also offers servers with K80 GPUs on monthly and hourly terms. The Google Cloud will offer VM instances with K80 GPUs and even faster P100 GPUs in 2017.

Running MXNet

The “getting started” exercise for MXNet is (surprise!) a simple network for training a classifier on the MNIST hand-drawn numerals data set, one of the easiest standard machine learning problems. This exercise covers how to train a multilayer perceptron model using Python, R, Scala, and Julia. The multilayer perceptron and a more accurate convolutional solution (LeNet) are covered in a Python tutorial. You’ll find the source code for this tutorial in Jupyter notebook form within the mxnet-notebooks/python/tutorials folder of the MXNet repository that you downloaded as part of the MXNet installation.

I ran the notebook locally on my MacBook Pro, only to have it fail at the import mxnet statement. After smacking myself on the forehead for making a newbie omission (aided and abetted by confusing documentation), I installed the Python language module with the script in the python directory of the repository. I used the current-user option,

python develop —user

When I restarted the MNIST notebook, it failed at a call to mx.viz.plot_network. Alas, the documentation hadn’t mentioned the need to install Graphviz. I tried to install Graphviz with pip, which seemed like the obvious thing to do, but the notebook failed again with a message saying that it couldn’t find the Graphviz executables. I had no idea what was wrong, so I Googled the error message, found many reports of similar problems, and guessed about which of the multiple answers (none of which had feedback) might be correct for my system. After tossing yarrow stalks I reinstalled Graphviz with Homebrew, then the MNIST notebook worked and displayed the network graph.

mxnet multilayer perceptron

MXNet comes with a Python tutorial on classifying the MNIST data set with a multilayer perceptron model. Here we have used the Graphviz package to plot the network defined with calls to the MXNet symbolic module.

A few cells later, training the multilayer perceptron model succeeded and took approximately one second per epoch.

mxnet mnist feedforward

Multilayer perceptron network training with MXNet. As you can see, each epoch took about one second, and after nine epochs the validation accuracy was 96.97 percent.

The tutorial continues with another model, the LeNet Convolutional Neural Network (CNN), which failed with the message “Compile with USE_CUDA=1 to enable GPU usage.” I pretty much knew from working with TensorFlow that compiling for CUDA wouldn’t work on my Mac, but I tried it anyway and got a message that the Nvidia CUDA compiler driver NVCC is not compatible with Xcode 8. (Nvidia has been promising to correct this for at least a month.)

To cover all the bases, I switched the active compiler to Xcode 7.3.1, which is compatible with NVCC but caused compile errors. I switched back to Xcode 8, restarted the notebook, commented out the line setting the GPU context, and ran the CNN on the default CPU:

model = mx.model.FeedForward(
    # ctx = mx.gpu(0),    
    symbol = lenet,      
    num_epoch = 10,    
    learning_rate = 0.1)
    batch_end_callback = mx.callback.Speedometer(batch_size, 200)

That worked, but was quite slow. It took about three minutes per epoch and about half an hour for the entire run. It didn’t seem to be using more than one core, based on the Mac Activity Monitor. Apparently the default MXNet context runs on CPU core 0.

mxnet lenet 1 cpu core

LeNet convolutional model training on MNIST data. The time per epoch on one core was about three minutes, and the validation accuracy after 10 epochs was 98.75 percent.

I naively tried to run the LeNet training on all eight cores, passing a list for the context:

ctx = [mx.cpu(0), mx.cpu(1), mx.cpu(2), mx.cpu(3), mx.cpu(4), mx.cpu(5), mx.cpu(6), mx.cpu(7)]

That sped up the training, but only by a factor of three, and the convergence wasn’t as good running in parallel:

mxnet lenet 8 cpu cores

Using eight CPU cores, we only get a 3X speedup over using one core, and the final validation accuracy is not quite as good.

I tried this again with only four cores:

ctx = [mx.cpu(n) for n in range(4)], #use 4 cores

That gave me roughly the same training speed as eight cores, with a final validation accuracy closer to the one-core result. I don’t know why exactly, but I think the limit on core usage has to do with the number of layers in the model. The variation in results does bring home the fact that ML training is a random process. The subject of running MXNet in parallel is documented, but I haven’t found an answer to the question of optimum parallelism.

You’ve probably gotten the idea by now that that the MXNet documentation leaves something to be desired. I hope that will be one of the items that Amazon helps to improve, now that it has adopted the project. 

MXNet tutorials and models

In addition to MNIST digit classification, the computer vision MXNet tutorials in Python cover image classification and segmentation using convolutional neural networks (CNN); object detection using Faster R-CNN; neural art; and classification of ImageNet using a deep CNN. In R, only two of these have been implemented; in Scala, only the MNIST tutorial is shown.

For natural language processing (NLP), the Python tutorials are for character-level long short-term memory (LSTM); text classification using a CNN; and noise-contrastive estimation (NCE) loss with an LSTM model. In R, there’s a different character language model. There is no Scala example for NLP.

There are two Python examples for speech recognition; a Python generative adversarial network trained on MNIST; three Python unsupervised machine learning tutorials; and one R supervised machine learning tutorial. There are additional models in the example folder of the MXNet repository, mostly in Python.

Overall, I have to agree with the authors of MXNet that their deep learning framework is similar to TensorFlow in many respects. As for differences, while MXNet lacks the visual debugging available for TensorFlow in TensorBoard, it offers an imperative language for tensor calculations that TensorFlow lacks.

The MXNet “model zoo” is not yet as complete as TensorFlow’s, and the MXNet documentation needs some work. The multi-GPU scaling performance reported by Amazon is exciting, however. Given the complexity and computational cost of training deep learning models, this might be enough to attract deep learning practitioners to MXNet even in its present state.

InfoWorld Scorecard
Models and algorithms (25%)
Ease of development (25%)
Documentation (20%)
Performance (20%)
Ease of deployment (10%)
Overall Score (100%)
MXNet v0.7 8 8 7 10 8 8.2
At a Glance
  • This portable, scalable deep learning library combines symbolic declaration of neural network geometries with imperative programming of tensor operations.


    • Scales to multiple GPUs across multiple hosts with scaling efficiency of 85 percent
    • Excellent development speed and programmability
    • Excellent portability
    • Supports Python, R, Scala, Julia, and C++
    • Allows you to mix symbolic and imperative programming flavors


    • Documentation still feels unfinished
    • Few examples for languages other than Python

Copyright © 2016 IDG Communications, Inc.