Review: Amazon SageMaker scales deep learning

AWS machine learning service offers easy scalability for training and inference, includes a good set of algorithms, and supports any others you supply

Review: Amazon SageMaker scales deep learning
At a Glance

Amazon SageMaker, a machine learning development and deployment service introduced at re:Invent 2017, cleverly sidesteps the eternal debate about the “best” machine learning and deep learning frameworks by supporting all of them at some level. While AWS has publicly supported Apache MXNet, its business is selling you cloud services, not telling you how to do your job.

SageMaker, as shown in the screenshot below, lets you create Jupyter notebook VM instances in which you can write code and run it interactively, initially for cleaning and transforming (feature engineering) your data. Once the data is prepared, notebook code can spawn training jobs in other instances, and create trained models that can be used for prediction. SageMaker also sidesteps the need to have massive GPU resources constantly attached to your development notebook environment by letting you specify the number and type of VM instances needed for each training and inference job.

Trained models can be attached to endpoints that can be called as services. SageMaker relies on an S3 bucket (that you need to provide) for permanent storage, while notebook instances have their own temporary storage.

SageMaker provides 11 customized algorithms that you can train against your data. The documentation for each algorithm explains the recommended input format, whether it supports GPUs, and whether it supports distributed training. These algorithms cover many supervised and unsupervised learning use cases and reflect recent research, but you aren’t limited to the algorithms that Amazon provides. You can also use custom TensorFlow or Apache MXNet Python code, both of which are pre-loaded into the notebook, or supply a Docker image that contains your own code written in essentially any language using any framework. A hyperparameter optimization layer is available as a preview for a limited number of beta testers.

In addition to running SageMaker from the AWS console, you can run it via its service API from your own programs. Within a Jupyter notebook, you can call the high-level Python library provided by Amazon SageMaker or the more basic AWS SDK for Python (Boto), in addition to common Python libraries such as NumPy.

The main competitors to Amazon SageMaker include Microsoft Machine Learning Studio, Azure Notebooks, and Azure Data Science Virtual Machines; Google Cloud Machine Learning Engine; and H2O Driverless AI. Another, easier set of alternatives to consider, especially if you’re not really a data scientist, would be the pre-built applied machine learning services, such as Amazon’s own Comprehend, Lex, Polly, Rekognition, Translate, and Transcribe services, as well as the equivalents from Google and Microsoft.

amazon sagemaker dashboard IDG

The Amazon SageMaker dashboard shows the four steps of the machine learning process: notebook instances, training jobs, models, and endpoints. It also shows your recent activity.

Amazon SageMaker architecture

As shown in the diagram below, SageMaker provides secure, scalable, pre-configured environments for developing, training, evaluating, and hosting machine learning models. In addition, it provides APIs by which it can be managed, and its own high-level Python framework for training and prediction. You supply training data to SageMaker in an S3 bucket; if the data resides elsewhere, such as Amazon Redshift or Aurora, you would run an ETL process to create the data set in S3.

This architecture does a good job of separating concerns, and allowing you to create data sets joined from multiple sources, which is a common use case. It doesn’t handle the admittedly less common situation where you want to analyze data streams—for that you might want to consider Spark MLlib.

amazon sagemaker block diagram IDG

The above block diagram explains how Amazon SageMaker fits together at a high level. Note the separation of concerns between training and inference code as well as between training data and model artifacts.

Amazon SageMaker notebooks

The SageMaker development environment is preloaded not only with SageMaker and Jupyter but also with Anaconda, CUDA, and cuDNN drivers, and optimized containers for TensorFlow and MXNet. You can also supply containers containing your own algorithms written using whatever languages and frameworks you desire.

When you create a SageMaker notebook instance you have a choice of sizes ranging from ml.t2.medium (2 vCPU, 4GiB RAM) to ml.p3.16xlarge (64 vCPU, 8 V100 GPU, 488 GiB RAM, 128 GiB GPU RAM), as shown in the screenshot below. Nvidia V100 GPUs each contain 640 tensor cores and deliver over 100 teraflops, making them each roughly 47 times faster than a CPU server for deep learning inference.

You can keep your notebook instance size small (and inexpensive) for your data wrangling and development without affecting your training and inference speeds, since training jobs and inference endpoints run in their own on-demand instances. When you run a training job you can specify the number and type of the instances that they should use. Exactly what number and type of instances will be optimal depends on the algorithms you choose, your time and cost constraints, and the size of your data. Not all algorithms need or support GPUs, and not all algorithms support running on multiple processors or multiple VMs. Helpfully, the documentation explains the supported configurations for each algorithm, although you may have to benchmark a bit yourself to find an optimum.

amazon sagemaker notebook instance types IDG

You can create a Jupyter notebook instance for Amazon SageMaker in any of 15 virtual machine sizes.

Amazon SageMaker algorithms

As you undoubtedly know, training and evaluation turn algorithms into models by optimizing their parameters to find the set of values that best matches the ground truth of your data. SageMaker has 11 of its own algorithms. These include four unsupervised algorithms: k-means clustering, which attempts to find discrete groupings within data; principal component analysis (PCA), which attempts to reduce the dimensionality (number of features) within a data set while still retaining as much information as possible; latent Dirichlet allocation (LDA), which attempts to describe a set of observations as a mixture of distinct categories; and neural topic model (NTM), which tries to categorize documents by their probable topics.

Linear learning is a supervised learning algorithm used for solving either classification or regression problems. A factorization machine (used in the screenshots below) is an extension of a linear model that is designed to parsimoniously capture interactions between features within high dimensional sparse data sets, in this case only the second-order interactions. XGBoost (extreme gradient boosting) implements the gradient boosted trees algorithm, which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

amazon sagemaker training output IDG

An Amazon SageMaker factorization machines training job is called from a Jupyter notebook. The job runs in its own instance, and the notebook waits until it has completed.

The Amazon SageMaker image classification algorithm is a supervised learning algorithm that takes an image as input and classifies it into one of multiple output categories, using a convolutional neural network (ResNet) that can either be trained from scratch or trained using transfer learning when a large number of training images are not available. Sequence2Sequence (seq2seq) is an implementation of Sockeye, which uses recurrent neural networks (RNNs) and convolutional neural network (CNN) models with attention for machine translation, text summarization, and speech-to-text.

DeepAR is a supervised learning algorithm for forecasting scalar time series using recurrent neural networks. Unlike classical forecasting (e.g. ARIMA and exponential smoothing), DeepAR trains a model for predicting a time series over a large set of (related) time series. Finally, BlazingText algorithm is an implementation of the Word2vec algorithm, which learns high-quality distributed vector representations of words (word embeddings) in a large collection of documents.

This is a pretty fair set of algorithms, reflecting current research in machine learning and deep learning, and in many cases optimized to run well on GPUs (when they’re useful) and even on multi-GPU and multi-machine clusters for distributed training. The ones I have examined appear to be implemented on top of MXNet. You may, however, want to use another algorithm that might work better for your data.

amazon sagemaker factorization confusion matrix IDG

Once you have trained your model with Amazon SageMaker, you can test how well it works. In this portion of the Jupyter notebook the code predicts labels for a batch of images and creates a confusion matrix for identification of the digit 0.

Using Amazon SageMaker with TensorFlow, MXNet, Spark, and your own custom algorithm

To support the use of any algorithm in a generic way, SageMaker allows you to supply an algorithm as a Docker container adhering to a specific format, which SageMaker can call with hyperparameters, input data channel information, training data, and a distributed training configuration. The same container can be used for training and inference, or you can supply two containers, for example to optimize the memory usage for production inference.

SageMaker also directly supports your TensorFlow and MXNet models. In both cases you supply custom code implementing specific training and inference function interfaces that SageMaker can call. You can use the high-level SageMaker Python SDK to simplify your Amazon-specific code.

Spark integration is a little different. Basically, you’ll want to continue to do your data preprocessing in Spark, then use the estimator in the Amazon SageMaker Spark library to train your model. You’ll pass your Spark DataFrame to SageMaker, and get back a SageMakerModel object, which you can then pass back to SageMaker to create an inference endpoint.

Amazon SageMaker models and endpoints for inference

No matter how you trained your model, you’ll want to deploy it for inference. Step one is to formally create the model, providing a name, the S3 path where the model artifacts are stored, and the Amazon Elastic Container Registry path for the Docker image that contains the inference code, which may be a standard Amazon algorithm image or a custom image that you’ve created.

Step two is to create an endpoint configuration for an HTTPS endpoint, meaning that you specify the name of one or more models in production variants and the machine learning compute instances that you want Amazon SageMaker to launch to host them. You can configure the endpoint to elastically scale the deployed machine learning compute instances. To actually create the endpoint, you pass the endpoint configuration to SageMaker, which creates the instances and exposes the HTTPS URL to accept inference requests from clients.

Production variants allow you to test new or revised models on a fraction of client requests. When your testing is complete, you can adjust the ratios so that the new model takes all of the subsequent requests and the old model becomes inactive.

Amazon SageMaker hyperparameter optimization

Amazon is currently testing a hyperparameter optimization (HPO) layer for SageMaker. Essentially, this generates multiple settings of the hyperparameters for a model and iteratively finds an optimal set of hyperparameters for the model and the data, within a specified range. The code below specifies the input data for an HPO job, and the graph that follows shows the results for one of the hyperparameters being tuned.

amazon sagemaker hyperparameter tuning config IDG

This is the configuration structure passed to HPO (hyperparameter optimization) to define a run, in this case for three training jobs run in parallel at each step and a total of 20 jobs. Note that it is using a Bayesian strategy to maximize the area under the validity curve (valid-auc), with the maximum depth parameter varying between 1 and 10.

objective metrics vs max depth IDG

This chart summarizes the effect of the “max_depth” hyperparameter on the quality of the machine learning model, as discovered by the Amazon SageMaker hyperparameter tuning job set up by the code in the previous image. Although several hyperparameters are being tuned, this chart shows only one, with the max_depth on the horizontal axis and the model quality on the vertical access. Higher is better for the area under the curve. Since the best of the 20 runs was at the highest maximum depth allowed, another HPO job with a higher limit is indicated.

Amazon SageMaker and the machine learning landscape

Overall, SageMaker has significantly improved the utility of AWS for data scientists by putting everything necessary for training, validating, and deploying models in one place, and taking care of most of the scut work involved in setting up and tearing down instances for distributed training. The 11 algorithms supplied with SageMaker ought to cover 80 percent of what people want to do with machine learning and deep learning. The other 20 percent can be covered by writing TensorFlow or MXNet code, or code in another framework (if you, for example, have an algorithm that works for you that is written in PyTorch or CNTK) by providing one or two custom containers for training and inference.

At a Glance
  • Amazon SageMaker is a highly scalable machine learning and deep learning service that supports 11 algorithms of its own, plus any others you supply. Hyperparameter optimization is still in preview, and you need to do your own ETL and feature engineering.


    • Excellent, easy scalability for training and inference
    • Models provided with service are robust and perform well
    • Easy on-demand access to high-end GPUs
    • Able to use TensorFlow, MXNet and other machine learning and deep learning frameworks


    • Data must be in S3, although other AWS services can perform ETL to S3
    • Fewer models provided than comparable services, although you can use your own
    • You need to do your own feature engineering
    • Hyperparameter optimization is still in preview
1 2 Page 1
Page 1 of 2