Review: AWS AI and Machine Learning stacks up, and up
Amazon Web Services provides an impressively broad and deep set of machine learning and AI services, rivaling Google Cloud and Microsoft Azure.
SageMaker Debugger supports the most common machine learning frameworks including TensorFlow, PyTorch, Apache MXNet, Keras, and XGBoost. SageMaker’s built-in containers for these frameworks come pre-installed with SageMaker Debugger, enabling you to monitor, profile, and debug your training scripts. You can also use SageMaker Debugger with custom training containers.
One of the ways to monitor your training runs is to view the SageMaker Debugger while your training jobs are running. This particular run triggered eight rules, for which Debugger supplied suggestions.
Amazon SageMaker Model Monitor
While SageMaker Debugger monitors model training, SageMaker Model Monitor monitors an endpoint’s model inference, and lets you know if and when it sees model drift (accuracy dropping over time) or concept drift (the difference between data used to train models and data used during inference). Model Monitor is also integrated with Amazon SageMaker Clarify to improve visibility into potential bias.
SageMaker Model Monitor can detect model and concept drift over time. It can also detect bias drift over time, as shown in the time series graph at the bottom of the screen.
Amazon SageMaker Distributed Training
SageMaker now supports two kinds of distributed training of its own, in addition to the framework-specific APIs for TensorFlow (Horovod) and PyTorch (DDP). Amazon claims a 40% reduction in distributed training time, but I’m not really sure how they came up with that number.
The two distributed training mechanisms address different problems. Data parallelism addresses excessively large amounts of training data by distributing mini-batches across multiple workers and averaging all the gradients after each epoch (the all-reduce step). You should try training on bigger instances with more GPUs and larger GPU memory before resorting to training on multiple VM instances.
Model parallelism addresses large models that don’t fit into a node’s memory in one piece by dividing the neural network up into layers and distributing the layers across nodes. Partitioning a neural network by hand often takes weeks, but SageMaker can split your model in seconds by profiling it with SageMaker Debugger and finding the most efficient way to partition it across GPUs. One of the tricks that SageMaker uses for model parallelism is to construct an interleaved pipeline, prioritizing backward execution of mini-batches whenever possible.
Which mechanism should you use? If you can, use data parallelism. If you still can’t fit the model into the memory of the biggest GPU available to you after trying all the tricks, then try model parallelism. Note that model parallelism saves memory for large models, enabling you to train using batch sizes that previously did not fit in memory.
Amazon SageMaker Pipelines
MLOps is becoming a big deal, finally. (It seems to me that data scientists are at least 10 years behind software developers when it comes to operations.) Amazon claims that the new SageMaker Pipelines product is the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning. I suspect that both Google Cloud and Microsoft Azure disagree with that claim.
SageMaker Pipelines helps you automate different steps of the machine learning workflow, including data loading, data transformation, training and tuning, and deployment. According to AWS, with SageMaker Pipelines you can build dozens of machine learning models weekly, manage massive volumes of data, thousands of training experiments, and hundreds of different model versions. You can share and re-use workflows to recreate or optimize models, helping you scale machine learning throughout your organization.
You can view SageMaker Pipeline graphs in SageMaker Studio, after building them with a Python SDK.
Amazon SageMaker Edge Manager
Edge computing is information processing located near where the data is produced that is also connected to the cloud, often through an edge gateway. You hear about Edge mostly in the context of the Internet of Things. There has been a huge boom in the number and capabilities of edge devices, such as the Nvidia Jetson family of modules for embedded systems, which include GPUs and can run multiple neural networks in parallel while reading input from multiple sensors.
Amazon SageMaker Edge Manager allows you to optimize, secure, monitor, and maintain machine learning models on fleets of smart cameras, robots, personal computers, and mobile devices. It provides a software agent that runs on edge devices, and contains a machine learning model optimized with SageMaker Neo. The agent also collects prediction data and sends a sample of the data to the cloud for monitoring, labeling, and retraining so you can keep models accurate over time. All data can be viewed in the SageMaker Edge Manager dashboard, which reports on the operation of deployed models across your fleet of edge devices.
SageMaker Edge Manager allows you to optimize and package trained models using different frameworks such as Darknet, Keras, Apache MXNet, PyTorch, TensorFlow, TensorFlow Lite, ONNX, and XGBoost for inference on Android, iOS, Linux, and Windows-based machines. It supports gRPC, an open source remote procedure call, which allows you to integrate SageMaker Edge Manager with your existing edge applications through APIs.
AWS AI Services
While SageMaker is primarily for training, deploying, and managing your own models, the AWS AI services are pre-trained and ready to use. In general, it’s more efficient to use a pre-trained service if it does what you need.
Amazon’s non-industry-specific AI services include Amazon Kendra (enterprise search), Amazon Personalize (recommendations), AWS Contact Center Intelligence, Amazon Comprehend (text analytics), Amazon Textract (document analysis), Amazon Translate, Amazon Lookout for Metrics (anomaly detection), Amazon Forecast (demand forecasting), Amazon Fraud Detector, Amazon Lookout for Vision (quality inspection), AWS Panorama (vision at the edge), Amazon Rekognition (image and video analysis), Amazon Polly (text to speech), Amazon Transcribe (speech to text), Amazon Lex (chatbots), Amazon DevOps Guru, Amazon CodeGuru Reviewer (for Java and Python code), and Amazon CodeGuru Profiler (runtime behavior of code).
Amazon Kendra
Amazon Kendra is a managed enterprise search service that enables your users to search unstructured data using natural language. It returns specific answers to three types of questions: factoids (who, what, where, or when), descriptions (how-to), and (with less specificity) keywords.
By understanding the question asked, Amazon Kendra can return a specific answer instead of everything related to the keywords used.
Amazon Personalize
The classic example of a personalized home page is, of course, Amazon.com, where you will always see products that Amazon thinks you might buy based on your browsing and buying history. Amazon Personalize enables developers to build applications with the same machine learning technology used by Amazon.com for real-time personalized recommendations, without requiring you to know anything about machine learning.
AWS Contact Center Intelligence and Amazon Connect
Amazon Connect is the omni-channel contact center solution that Amazon built for itself a decade ago. It is appropriate if your organization doesn’t yet have a contact center. You need to contact AWS sales for pricing, but you can create an instance and connect it to a telephone number yourself.
AWS Contact Center Intelligence is an add-on to existing contact centers that “offers a variety of ways to quickly and cost effectively add intelligence.” Contact Center Intelligence is available through AWS partners.
Amazon Comprehend
Amazon Comprehend is a managed, pay-as-you-go natural language processing (NLP) service that uses machine learning to find insights and relationships in text. No ML experience is required. Services include Key Phrase Extraction, Sentiment Analysis, Entity Recognition, Language Detection, PII Detection, Event Detection, and Syntax Analysis. Each service is a separate API call.
Custom Comprehend can train a custom NLP model to categorize text and extract custom entities. Topic modeling identifies relevant terms or topics from a collection of documents stored in Amazon S3. There is also a medical version of Comprehend; see the industry-specific services below.
Amazon Textract
Amazon Textract is a managed document processing service with two APIs, Detect Document Text and Analyze Document. It goes beyond optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Textract uses machine learning to read and process any type of document, accurately extracting printed text, handwriting, forms, tables, and other data without the need for manual effort or custom code.
The Analyze Document API of the Amazon Textract service not only extracts raw text from an image, but also recognizes forms and tables (see the tabs at the top of the right-hand pane).
Amazon Translate
Amazon Translate is a neural machine translation service. Neural machine translations are typically much better than statistical and rule-based translations. Translate handles 71 languages and variants, roughly as many as Microsoft Azure Translator, but not as many as Google Translate.
You can customize translations with custom terminology and parallel data. Custom terminology, which is useful for getting brand names and technical terms right, requires you to attach a CSV or TMX file to your account, and specify the custom term as part of your translation request. There’s no extra charge for custom terminology.
Parallel data (Active Custom Translation), which is useful for influencing the voice and style of the translation, requires you to upload full translations in CSV, TSV, or TMX format to an S3 bucket, and call a special API that’s four times as expensive as standard translations.
The maximum document size for synchronous translations is 5K UTF-8 characters, but you can divide a large document into sentences to work around this limit. The maximum document size for asynchronous translations is 20 megabytes, but there are other limits as well.
The phrase “Good fresh Russian black bread” is a textbook test of Russian adjective endings. Amazon Translate passed.
Amazon Lookout for Metrics (Preview)
Amazon Lookout for Metrics uses machine learning to automatically detect and diagnose anomalies (outliers from the norm) in business and operational time series data. It connects to Amazon S3, Amazon Redshift, and Amazon Relational Database Service (RDS), as well as third-party SaaS applications, such as Salesforce, ServiceNow, Zendesk, and Marketo.
Lookout for Metrics automatically inspects and prepares the data from these sources and builds a custom machine learning model, informed by over 20 years of experience at Amazon, to detect anomalies. You can also provide feedback on detected anomalies to tune the results and improve accuracy over time.
Lookout for Metrics allows you to diagnose detected anomalies by grouping together anomalies that are related to the same event and sending an alert that includes a summary of the potential root cause. It also ranks anomalies in order of severity.
Amazon Forecast
Amazon Forecast is a managed service that uses automated machine learning to turn historical time series and related data into forecasts. Forecast includes algorithms that are based on over 20 years of forecasting experience at Amazon.com.
Amazon claims that its forecasts are 50% more accurate than ones based strictly on a single historical time series, when external factors are significant, as is often the case; that’s a believable number. Amazon also supplies public weather data to supplement your own data.
Another claim is that Amazon Forecast reduces forecasting time from months to hours. “Months” is an exaggeration, unless Amazon assumes that the forecasters don’t know statistics or machine learning at all and have to learn it from scratch.