Which Spark machine learning API should you use?

A brief introduction to Spark MLlib's APIs for basic statistics, classification, clustering, and collaborative filtering, and what they can do for you

Navigating a field of uncertainty and doubt questions
Credit: Thinkstock

You’re not a data scientist. Supposedly according to the tech and business press, machine learning will stop global warming, except that’s apparently fake news created by China. Maybe machine learning can find fake news (a classification problem)? In fact, maybe it can.

But what can machine learning do for you? And how will you find out? There’s a good place to start close to home, if you’re already using Apache Spark for batch and stream processing. Along with Spark SQL and Spark Streaming, which you’re probably already using, Spark provides MLLib, which is, among other things, a library of machine learning and statistical algorithms in API form.

Here is a brief guide to four of the most essential MLlib APIs, what they do, and how you might use them.  

Basic statistics

Mainly you’ll use these APIs for A-B testing or A-B-C testing. Frequently in business we assume that if two averages are the same then the two things are roughly equivalent. That isn’t necessarily true. Consider if a car manufacturer replaces the seat in a car and surveys customers on how comfortable it is. At one end the shorter customers may say the seat is much more comfortable. At the other end, taller customers will say it is really uncomfortable to the point that they wouldn’t buy the car and the people in the middle balance out the difference. On average the new seat might be slightly more comfortable but if no one over 6 feet tall buys the car anymore, we’ve failed somehow. Spark’s hypothesis testing allows you to do a Pearson chi-squared or a Kolmogorov–Smirnov test to see how well something “fits” or whether the distribution of values is “normal.” This can be used most anywhere we have two series of data. That “fit” might be “did you like it” or did the new algorithm provide “better” results than the old one. You’re just in time to enroll in a Basic Statistics Course on Coursera.


What are you? If you take a set of attributes you can get the computer to sort “things” into their right category. The trick here is coming up with the attribute that matches the “class,” and there is no right answer there. There are a lot of wrong answers. If you think of someone looking through a set of forms and sorting them into categories, this is classification. You’ve run into this with spam filters, which use a list of words spam usually has. You may also be able to diagnose patients or determine which customers are likely to cancel their broadcast cable subscription (people who don’t watch live sports). Essentially classification “learns” to label things based on labels applied to past data and can apply those labels in the future. In Coursera’s Machine Learning Specialization there is a course specifically on this that started on July 10, but I’m sure you can still get in.


If k-means clustering is the only thing out of someone’s mouth after you ask them about machine learning, you know that they just read the crib sheet and don’t know anything about it. If you take a set of attributes you may find “groups” of points that seem to be pulled together by gravity. Those are clusters. You can “see” these clusters but there may be clusters that are close together. There may be one big one and one small one on the side. There may be smaller clusters in the big cluster. Because of these and other complexities there are a lot of different “clustering” algorithms. Though different from classification, clustering is often used to sort people into groups. The big difference between “clustering” and “classification” is that we don’t know the labels (or groups) up front for clustering. We do for classification. Customer segmentation is a very common use. There are different flavors of that, such as sorting customers into credit or retention risk groups, or into buying groups (fresh produce or prepared foods), but it is also used for things like fraud detection. Here’s a course on Coursera with a lecture series specifically on clustering and yes, they cover k-means for that next interview, but I find it slightly creepy when half the professor floats over the board (you’ll see what I mean).

Collaborative filtering

Man, collaborative filtering is a popularity contest. The company I work for uses this to improve search results. I even gave a talk on this. If enough people click on the second cat picture it must be better than the first cat picture. In a social or e-commerce setting, if you use the likes and dislikes of various users, you can figure out which is the “best” result for most users or even specific sets of people. This can be done on multiple properties for recommender systems. You see this on Google Maps or Yelp when you search for restaurants (you can then filter by service, food, decor, good for kids, romantic, nice view, cost). There is a lecture on collaborative filtering from the Stanford Machine Learning course, which started on July 10 (but you can still get in).

This is not all you can do (by far) but these are some of the common uses along with the algorithms to accomplish them. Within each of these broad categories are often several alternative algorithms or derivatives of algorithms. Which to pick? Well, that’s a combination of mathematical background, experimentation, and knowing the data. Remember, just because you get the algorithm to run doesn’t mean the result isn’t nonsense.

If you’re new to all of this, then the Machine Learning Foundations course on Coursera is a good place to start -- despite the creepy floating half-professor.