Enjoy machine learning with Mahout on Hadoop

You can use the Hadoop ecosystem to manage your data. Now put that data to good use and apply machine learning via Mahout

"Mahout" is a Hindi term for a person who rides an elephant. The elephant, in this case, is Hadoop -- and Mahout is one of the many projects that can sit on top of Hadoop, although you do not always need MapReduce to run it.

Mahout puts powerful mathematical tools in the hands of the mere mortal developers who write the InterWebs. It's a package of implementations of the most popular and important machine-learning algorithms, with the majority of the implementations designed specifically to use Hadoop to enable scalable processing of huge data sets. Some algorithms are available only in a nonparallelizable "serial" form due to the nature of the algorithm, but all can take advantage of HDFS for convenient access to data in your Hadoop processing pipeline.

[ Know this right now about Hadoop | Work smarter, not harder -- download the Developers' Survival Guide for all the tips and trends programmers need to know. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]

Machine learning is probably the most practical subset of artificial intelligence (AI), focusing on probabilistic and statistical learning techniques. For all you AI geeks, here are some of the machine-learning algorithms included with Mahout: K-means clustering, fuzzy K-means clustering, K-means, latent Dirichlet allocation, singular value decomposition, logistic regression, naive Bayes, and random forests. Mahout also features higher-level abstractions for generating "recommendations" (à la popular e-commerce sites or social networks).

I know, when someone starts talking machine learning, AI, and Tanimoto coefficients you probably make popcorn and perk up, right? Me neither. Oddly, despite the complexity of the math, Mahout has an easy-to-use API. Here's a taste:

//load our datafile somehow

DataModel model = new FileDataModel(new File("data.txt"));

ItemSimilarity sim = new LogLikelihoodSimilarity(model);

GenericItemBasedRecommender r = new GenericItemBasedRecommender(model, sim);

LongPrimitiveIterator items = dm.getItemIDs();

while(items.hasNext()) {

long itemId = items.nextLong();

List<RecommendedItem> recommendations = r.mostSimilarItems(itemId, 10);

//do something with these recommendations

}

What this little snip would do is load a data file, curse through the items, then get 10 recommended items based on their similarity. This is a common e-commerce task. However, just because two items are similar doesn't mean I want them both. In fact, in many cases I probably don't want to buy two similar items. I mean, I recently bought a bike -- I don't want the most similar item, which would be another bike. However, other users who bought bikes also bought tire pumps, so Mahout offers user-based recommenders as well.

Both examples are very simple recommenders, and Mahout offers more advanced recommenders that take in more than a few factors and can balance user tastes against product features. None of these require advanced distributed computing, but Mahout has other algorithms that do.

Beyond recommendations

Mahout is far more than a fancy e-commerce API. In fact, other algorithms make predictions, classifications (such as the hidden Markov models that power most of the speech and language recognition on the Internet). It can even help you find clusters or, rather, group things, like cells ... of people or something so you can send them .... gift baskets to a single address.

Of course, the devil is in the details and I've glossed over the really important part, which is that very first line:

DataModel model = new FileDataModel(new File("data.txt"));

Hey, if you could get some math geeks to do all the work and reduce all of computing down to the 10 or so lines that compose the algorithm, we'd all be out of a job. However, how did that data get in the format we needed for the recommendations? Being able to design the implementation of that algorithm is why developers make the big bucks, and even if Mahout doesn't need Hadoop to implement many of its machine-learning algorithms, you might need Hadoop to put the data into the three columns the simple recommender required.

Mahout is a great way to leverage a number of features from recommendation engines to pattern recognition to data mining. Once we as an industry get done with the big, fat Hadoop deploy, the interest in machine learning and possibly AI more generally will explode, as one insightful commentator on my Hadoop article observed. Mahout will be there to help.

This article, "Enjoy machine learning with Mahout on Hadoop," was originally published at InfoWorld.com. Keep up on the latest news in application development and read more of Andrew Oliver's Strategic Developer blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Mobile Security Insider: iOS vs. Android vs. BlackBerry vs. Windows Phone
Join the discussion
Be the first to comment on this article. Our Commenting Policies