After spending decades in the shadows as a specialty discipline, machine learning is suddenly front and center as a business tool. The hard part, though, is making it useful, especially to the developers and budding data scientists who are being tasked with the job.
To that end, we rounded up some of the most common and useful open source machine learning tools we've spotted in the wild.
For Python: Data scientists have jumped on Python as a more open-ended alternative to analytical languages like R, and many employers looking to add big-data expertise to their rosters are listing Python as a desired skill. As a result, plenty of machine learning libraries have shown up in Python's ever-expanding software roster.
Of the bunch, the top choice is scikit-learn. It's loaded with algorithms and modules, is widely appreciated on GitHub (almost 2,000 forks and counting), and has a variety of big-name testimonials to its name. Another one with a sizable following is PyBrain, which is designed to be easy to work with while also providing access to some powerful tools. As the name implies, it's focused on the likes of neural networks and unsupervised learning, and it provides a mechanism for training and refining algorithms.
For Go: Google's system language designed for parallelism seems like an ideal environment for writing machine learning libraries. A slew of smaller, more specific libraries pepper the landscape, but a few general ones stand out. The most notable, GoLearn, is described by its creators as a "batteries included" machine-learning library, and it has tools for filtering, classification, and regression analysis. A much smaller and more basic library, mlgo, implements only a small number of algorithms at this time, but more are planned for the future.
For Java on Hadoop: Mahout (which means "elephant rider" in Hindi) bundles several common machine learning methodologies for use in everyone's favorite big data framework. The package is built around algorithms rather than methodologies, so some understanding the algorithms is required. That said, it isn't hard to see how the pieces fit together if you're diligent; a user-based recommendation system, for instance, can be done in a few lines of code.
Another Hadoop-based machine learning project, Cloudera's Oryx, is meant to build on Mahout's work by delivering real-time streaming results rather than working on batch jobs. Unfortunately, it's still in the early stages -- a project rather than a product -- but it deserves a close eye as it evolves.
For Java: Aside from the aforementioned Mahout, which focuses on Hadoop, a number of other other machine learning libraries for Java are in wide use. Weka, created by the University of Waikato in New Zealand, is a workbench-like app that adds visualizations and data-mining capabilities to the usual mix of algorithms. For people who want a front end for their work and plan on doing a good part of it in Java to begin with, Weka might be the best place to start. A more conventional library, the Java-ML, is also available, although it's meant for people already comfortable working with both Java and machine learning.