13 frameworks for mastering machine learning

Venturing into machine learning? These open source tools do the heavy lifting for you

13 frameworks for mastering machine learning
W.Rebel via Wikimedia

13 frameworks for mastering machine learning

Over the past year, machine learning has gone mainstream with a bang. The “sudden” arrival of machine learning isn’t fueled by cheap cloud environments and ever more powerful GPU hardware alone. It is also due to an explosion of open source frameworks designed to abstract away the hardest parts of machine learning and make its techniques available to a broad class of developers.

Here is a baker’s dozen of machine learning frameworks, either freshly minted or newly revised within the past year. These tools caught our attention for their provenance, for bringing a novel simplicity to their problem domain, for addressing a specific challenge associated with machine learning, or for all of the above.

[ See InfoWorld’s review of the best frameworks for machine learning and deep learning: TensorFlow, Spark MLlib, Scikit-learn, Microsoft Cognitive Toolkit, and Caffe. | Get a digest of the day’s top tech stories in the InfoWorld Daily newsletter. ]

spark mllib
Apache Foundation

Apache Spark MLlib

Apache Spark may be best known for being part of the Hadoop family, but this in-memory data processing framework was born outside of Hadoop and is making a name for itself outside the Hadoop ecosystem as well. Spark has become a go-to machine learning tool, thanks to its growing library of algorithms that can be applied to in-memory data at high speed.

Previous versions of Spark bolstered support for MLlib, a major platform for math and stats users, and allowed Spark ML jobs to be suspended and resumed via the persistent pipelines feature. Spark 2.0, released in 2016, improves on the Tungsten high-speed memory management system and the new DataFrames streaming API, both of which can provide performance boosts to machine learning apps.

h2o

H2O

H2O, now in its third major revision, provides access to machine learning algorithms by way of common development environments (Python, Java, Scala, R), big data systems (Hadoop, Spark), and data sources (HDFS, S3, SQL, NoSQL). H2O is meant to be used as an end-to-end solution for gathering data, building models, and serving predictions. For instance, models can be exported as Java code, allowing predictions to be served on many platforms and in many environments.

H2O can work as a native Python library, or by way of a Jupyter Notebook, or by way of the R language in R Studio. The platform also includes an open source, web-based environment called Flow, exclusive to H2O, which allows interacting with the dataset during the training process, not just before or after. 

apache singa
Apache Foundation

Apache Singa

Deep learning” frameworks power heavy-duty machine-learning functions, such as natural language processing and image recognition. Singa, an Apache Incubator project, is an open source framework intended to make it easy to train deep-learning models on large volumes of data.

Singa provides a simple programming model for training deep-learning networks across a cluster of machines, and it supports many common types of training jobs: convolutional neural networks, restricted Boltzmann machines, and recurrent neural networks. Models can be trained synchronously (one after the other) or asynchronously (side by side), on both CPU and GPU clusters, with FPGA support coming soon. Singa also simplifies cluster setup with Apache Zookeeper.

caffe2
Facebook

Caffe2

Deep-learning framework Caffe is “made with expression, speed, and modularity in mind.” Originally developed in 2013 for machine vision projects, Caffe has since expanded to include other applications, such as speech and multimedia.

Speed is a major priority, so Caffe has been written entirely in C++, with CUDA acceleration support, although it can switch between CPU and GPU processing as needed. The distribution includes a set of free and open source reference models for common classification jobs, with other models created and donated by the Caffe user community.

A new iteration of Caffe backed by Facebook, called Caffe2, is currently under development for a 1.0 release. Its goals are to make it easier to perform distributed training and deploy to mobile devices, to provide support for new kinds of hardware like FPGAs, and to make use of cutting-edge features like 16-bit floating-point training.

tensorflow

Google TensorFlow

Much like Microsoft’s DMTK, Google TensorFlow is a machine learning framework designed to scale across multiple nodes. As with Google’s Kubernetes, it was built to solve problems internally at Google, and Google eventually elected to release it as an open source product.

TensorFlow implements what are called data flow graphs, where batches of data (“tensors”) can be processed by a series of algorithms described by a graph. The movements of the data through the system are called “flows”—hence the name. Graphs can be assembled with C++ or Python and can be processed on CPUs or GPUs. 

Recent updates to TensorFlow have added better compatibility with Python, improved GPU operations, opened the door to running TensorFlow on a broader variety of hardware, and expanded the library of built-in classification and regression tools.

amazon ml

Amazon Machine Learning

Amazon’s approach to cloud services has followed a pattern. Provide the basics, bring in a core audience that cares, let them build on top of it, then find out what they really need and deliver that.

The same could be said of Amazon’s foray into offering machine learning as a service, Amazon Machine Learning. It connects to data stored in Amazon S3, Redshift, or RDS, and can run binary classification, multiclass categorization, or regression on that data to create a model. However, note that the resulting models can’t be imported or exported, and datasets for training models can’t be larger than 100GB.

Still, Amazon Machine Learning shows how machine learning is being made a practicality instead of a luxury. And for those who want to go further, or remain less tightly coupled to the Amazon cloud, Amazon’s Deep Learning machine image includes many of the major deep learning frameworks including Caffe2, CNTK, MXNet, and TensorFlow. 

azure ml
Microsoft

Microsoft Azure ML Studio

Given the sheer amount of data and computational power needed to perform machine learning, the cloud is an ideal environment for ML apps. Microsoft has outfitted Azure with its own pay-as-you-go machine learning service, Azure ML Studio, with monthly, hourly, and free-tier versions. (The company’s HowOldRobot project was created with this system.) You don’t even need an account to try out the service; you can log in anonymously and use Azure ML Studio for up to eight hours.

Azure ML Studio allows users to create and train models, then turn them into APIs that can be consumed by other services. Users of the free tier get up to 10GB of storage per account for model data, and you can connect your own Azure storage to the service for larger models. A wide range of algorithms is available, courtesy of both Microsoft and third parties. 

Recent improvements include batched management of training jobs by way of the Azure Batch service, better deployment management controls, and detailed web service usage statistics.

microsoft dmtk
Microsoft

Microsoft Distributed Machine Learning Toolkit

The more computers you have to throw at any machine learning problem, the better—but developing ML applications that run well across large numbers of machines can be tricky. Microsoft’s DMTK (Distributed Machine Learning Toolkit) framework tackles the issue of distributing various kinds of machine learning jobs across a cluster of systems.

DMTK is billed as a framework rather than a full-blown out-of-the-box-solution, so the number of algorithms included with it is small. However, you will find key machine learning libraries such as a gradient boosting framework (LightGBM) and support for a few deep learning frameworks like Torch and Theano.

The design of DMTK allows for users to make the most of clusters with limited resources. For instance, each node in the cluster has a local cache, reducing the amount of traffic with the central server node that provides parameters for the job in question.  

microsoft cntk
Microsoft

Microsoft Computational Network Toolkit

Hot on the heels of releasing DMTK, Microsoft unveiled yet another machine learning toolkit, the Computational Network Toolkit, or CNTK for short.

CNTK is similar to Google TensorFlow in that it lets users create neural networks by way of a directed graph. Microsoft also considers CNTK to be comparable to projects like Caffe, Theano, and Torch – except for the ability of CNTK to achieve greater speed by exploiting both multiple CPUs and multiple GPUs in parallel. Microsoft claims that running CNTK on GPU clusters on Azure allowed it to accelerate speech recognition training for Cortana by an order of magnitude.

The latest edition of the framework, CNTK 2.0, turns up the heat on TensorFlow by improving accuracy, adding a Java API for the sake of Spark compatibility, and supporting code from the Keras framework (commonly used with TensorFlow).

apache mahout
Apache Foundation

Apache Mahout

Mahout was originally built to allow scalable machine learning on Hadoop, long before Spark usurped that throne. But after a long period of relatively minimal activity, Mahout has been rejuvenated with new additions, such as a new environment for math, called Samsara, that allows algorithms to be run across a distributed Spark cluster. Both CPU and GPU operations are supported.

The Mahout framework has long been tied to Hadoop, but many of the algorithms under its umbrella can also run as-is outside of Hadoop. These are useful for stand-alone applications that might eventually be migrated into Hadoop or for Hadoop projects that could be spun off into their own stand-alone applications.

veles

Veles

Veles is a distributed platform for deep-learning applications, and like TensorFlow and DMTK, it’s written in C++, although it uses Python to perform automation and coordination between nodes. Datasets can be analyzed and automatically normalized before being fed to the cluster, and a REST API allows the trained model to be used in production immediately (assuming your hardware is up to the task).

Veles goes beyond merely employing Python as glue code, as the Python-based Jupyter Notebook can be used to visualize and publish results from a Veles cluster. Samsung hopes that releasing Veles as open source will stimulate further development, such as ports to Windows and MacOS.

mlpack
mlpack contributors

mlpack 2

A C++-based machine learning library originally rolled out in 2011, mlpack is designed for “scalability, speed, and ease-of-use,” according to the library’s creators. Implementing mlpack can be done through a cache of command-line executables for quick-and-dirty “black box” operations, or with a C++ API for more sophisticated work.

Version 2 of mlpack includes many new kinds of algorithms, along with refactorings of existing algorithms to speed them up or slim them down. For example, it ditches the Boost library’s random number generator in favor of C++11’s native random functions.

One longstanding disadvantage of mlpack is the lack of bindings for any language other than C++. That means users of other languages will need a third-party library, such as the one for Python. Work has been done to add MATLAB support, but projects like mlpack tend to enjoy greater uptake when they’re directly useful in the major environments where machine learning work takes place.

neon

Neon

Nervana, a company that builds its own deep learning hardware and software platform (now part of Intel), has offered up a deep learning framework named Neon as an open source project. Neon uses pluggable modules to allow the heavy lifting to be done on CPUs, GPUs, or Nervana’s own silicon.

Neon is written chiefly in Python, with a few pieces in C++ and assembly for speed. This makes the framework immediately available to others doing data science work in Python or in any other framework that has Python bindings.

Many standard deep learning models such as LSTM, AlexNet, and GoogLeNet, are available as pre-trained models for Neon. The latest release, Neon 2.0, adds Intel’s Math Kernel Library to accelerate performance on CPUs.

marvin
Princeton Vision Group

Marvin

Another relatively recent production, the Marvin neural network framework, is a product of the Princeton Vision Group. Marvin was “born to be hacked,” as its creators explain in the documentation for the project, which relies only on a few files written in C++ and the CUDA GPU framework. Despite the deliberately minimal code, the project does come with a number of pretrained models that can be reused with proper citation and contributed to with pull requests like the project’s own code.