11 open source tools to make the most of machine learning

Tap the predictive power of machine learning with these diverse, easy-to-implement libraries and frameworks

Thinkstock

11 open source tools for making the most of machine learning

Spam filtering, face recognition, recommendation engines—when you have a large data set on which you’d like to perform predictive analysis or pattern recognition, machine learning is the way to go. The proliferation of free open source software has made machine learning easier to implement both on single machines and at scale, and in most popular programming languages. These 11 open source tools include libraries for the likes of Python, R, C++, Java, Scala, Clojure, JavaScript, and Go.

[ The InfoWorld review roundup: AWS, Microsoft, Databricks, Google, HPE, and IBM machine learning in the cloud. | Get a digest of the day’s top tech stories in the InfoWorld Daily newsletter. ]

Scikit-learn Developers

Scikit-learn

Python has become a go-to programming language for math, science, and statistics due to its ease of adoption and the breadth of libraries available for nearly any application. Scikit-learn leverages this breadth by building on top of several existing Python packages—NumPy, SciPy, and Matplotlib—for math and science work. The resulting libraries can be used for interactive “workbench” applications or embedded into other software and reused. The kit is available under a BSD license, so it’s fully open and reusable.

Project: scikit-learn
GitHub:
https://github.com/scikit-learn/scikit-learn

The Shogun Team

Shogun

Venerable Shogun was created in 1999 and written in C++, but can be used with Java, Python, C#, Ruby, R, Lua, Octave, and Matlab. The latest version, 6.0.0, adds native support for Microsoft Windows and the Scala language.

Though popular and wide-ranging, Shogun has competition. Another C++-based machine learning library, Mlpack, has been around only since 2011, but professes to be faster and easier to work with (by way of a more integral API set) than competing libraries.

Project: Shogun
GitHub: https://github.com/shogun-toolbox/shogun

César Souza

Accord.Net Framework

Accord, a machine learning and signal processing framework for .Net, is an extension of a previous project in the same vein, AForge.net. Accord includes a set of libraries for processing audio signals and image streams (such as videos). Its algorithms for vision processing can be used for tasks such as face detection, for stitching together images, or for tracking moving objects.

Accord also includes libraries that provide a more conventional gamut of machine learning functions, from neural networks to decision-tree systems.

Project: Accord Framework/AForge.net
GitHub: https://github.com/accord-net/framework/

Apache Software Foundation

Apache Mahout

Apache Mahout has long been tied to Hadoop, but many of the algorithms under its umbrella can also run outside Hadoop. They’re useful for stand-alone applications that might eventually be migrated into Hadoop or for Hadoop projects that could be spun off into their own stand-alone applications. The last few versions have bolstered support for the high-perfomance Spark framework, and added support for the ViennaCL library for GPU-accelerated linear algebra.

Project: Mahout

Apache Software Foundation

Spark MLlib

The machine learning library for Apache Spark and Apache Hadoop, MLlib boasts many common algorithms and useful data types, designed to run at speed and scale. Although Java is the primary language for working in MLlib, Python users can connect MLlib with the NumPy library, Scala users can write code against MLlib, and R users can plug into Spark as of version 1.5. 

Another project, MLbase, builds on top of MLlib to make it easier to derive results. Rather than write code, users make queries by way of a declarative language à la SQL.

Project: MLlib

H2O.ai

H2O

H2O’s algorithms are geared for business processes—fraud or trend predictions, for instance—rather than, say, image analysis. H2O can interact in a stand-alone fashion with HDFS stores, on top of YARN, in MapReduce, or directly in an Amazon EC2 instance. Hadoop mavens can use Java to interact with H2O, but the framework also provides bindings for Python, R, and Scala, allowing you to interact with all of the libraries available on those platforms as well.

Project: H20
GitHub: https://github.com/0xdata/h2o

Cloudera

Cloudera Oryx

Oryx, courtesy of the creators of the Cloudera Hadoop distribution, uses Spark and the Kafka stream processing framework to run machine learning models on real-time data. Oryx provides a way to build projects that require decisions in the moment, like recommendation engines or live anomaly detection, that are informed by both new and historical data. Version 2.0 is a near-complete redesign of the project, with its components loosely coupled in a lambda architecture. New algorithms, and new abstractions for those algorithms (e.g., for hyperparameter selection), can be added at any time.

Project: Cloudera Oryx
GitHub:
https://github.com/cloudera/oryx

Stephen Whitworth

GoLearn

GoLearn, a machine learning library for Google’s Go language, was created with the twin goals of simplicity and customizability, according to developer Stephen Whitworth. The simplicity lies in the way data is loaded and handled in the library, which is patterned after SciPy and R. The customizability lies in how some of the data structures can be easily extended in an application. Whitworth has also created a Go wrapper for the Vowpal Wabbit library, one of the libraries found in the Shogun toolbox.

Project: GoLearn
GitHub:
https://github.com/sjwhitworth/golearn

University of Waikato

Weka

Weka is a set of Java machine learning algorithms engineered specifically for data mining. This GNU GPLv3-licensed collection has a package system to extend its functionality, with both official and unofficial packages available. Weka even comes with a book to explain both the software and the techniques used.

While Weka isn’t aimed specifically at Hadoop users, the most recent versions can be used with Hadoop thanks to a set of wrappers. Note that Weka doesn’t yet support Spark, only MapReduce. Clojure users can leverage Weka via the Clj-ml library.

Project: Weka

Google

Deeplearn.js

Another project for deep learning in the web browser, Deeplearn.js, comes by way of Google. Neural network models can be trained directly in any modern browser, without additional client-side software. Deeplearn.js can also perform GPU-accelerated computation by way of the WebGL API, so performance is not limited to the system’s CPU. The functions available in the project are patterned after Google’s TensorFlow, making it easy for users of that project to get started with this one.

Project: Deeplearn.js

Andrej Karpathy

ConvNetJS

As the name implies, ConvNetJS is a JavaScript library for neural network machine learning, facilitating use of the browser as a data workbench. An NPM version is also available for those using Node.js, and the library is designed to make proper use of JavaScript’s asynchronicity. For example, training operations can be given a callback to execute once they complete. Plenty of demo examples are included, too.

Project: ConvNetJS
GitHub:
https://github.com/karpathy/convnetjs