What do Russian trolls, Facebook, and US elections have to do with machine learning? Recommendation engines are at the heart of the central feedback loop of social networks and the user-generated content (UGC) they create. Users join the network and are recommended users and content with which to engage. Recommendation engines can be gamed because they amplify the effects of thought bubbles. The 2016 US presidential election showed how important it is to understand how recommendation engines work and the limitations and strengths they offer.
AI-based systems aren’t a panacea that only creates good things; rather, they offer a set of capabilities. It can be incredibly useful to get an appropriate product recommendation on a shopping site, but it can be equally frustrating to get recommended content that later turns out to be fake (perhaps generated by a foreign power motivated to sow discord in your country).
This chapter covers recommendation engines and natural language processing (NLP), both from a high level and a coding level. It also gives examples of how to use frameworks, such as the Python-based recommendation engine Surprise, as well as instructions how to build your own. Some of the topics covered including the Netflix prize, singular-value decomposition (SVD), collaborative filtering, real-world problems with recommendation engines, NLP, and production sentiment analysis using cloud APIs.
The Netflix prize wasn’t implemented in production
Before “data science” was a common term and Kaggle was around, the Netflix prize caught the world by storm. The Netflix prize was a contest created to improve the recommendation of new movies. Many of the original ideas from the contest later turned into inspiration for other companies and products. Creating a $1 million data science contest back in 2006 sparked excitement that would foreshadow the current age of AI. In 2006, ironically, the age of cloud computing also began, with the launch of Amazon EC2.
The cloud and the dawn of widespread AI have been intertwined. Netflix also has been one of the biggest users of the public cloud via Amazon Web Services. Despite all these interesting historical footnotes, the Netflix prize-winning algorithm was never implemented into production. The winners in 2009, the BellKor’s Pragmatic Chaos team, achieved a greater than 10 percent improvement with a Test RMS of 0.867. The team’s paper describes that the solution is a linear blend of more than 100 results. A quote in the paper that is particularly relevant is “A lesson here is that having lots of models is useful for the incremental results needed to win competitions, but practically, excellent systems can be built with just a few well-selected models.”
The winning approach for the Netflix competition was not implemented in production at Netflix because the engineering complexity was deemed too great when compared with the gains produced. A core algorithm used in recommendations, SVD, as noted in “Fast SVD for Large- Scale Matrices,” “though feasible for small data sets or offline processing, many modern applications involve real-time learning and/or massive data set dimensionality and size.” In practice, this is one of huge challenges of production machine learning: the time and computational resources necessary to produce results.
I had a similar experience building recommendation engines at companies. When an algorithm is run in a batch manner, and it is simple, it can generate useful recommendations. But if a more complex approach is taken, or if the requirements go from batch to real time, the complexity of putting it into production and/or maintaining it explodes. The lesson here is that simpler is better: choosing to do batch-based machine learning versus real-time. Or choosing a simple model versus an ensemble of multiple techniques. Also, deciding whether it may make sense to call a recommendation engine API versus creating the solution yourself.
Key concepts in recommendation systems
Figure 1 shows a social network recommendation feedback loop. The more users a system has, the more content it creates. The more content that is created, the more recommendations it creates for new content. This feedback loop, in turn, drives more users and more content. As mentioned at the beginning of this chapter, these capabilities can be used for both positive and negative features of a platform.
Figure 1: The social network recommendation feedback loop
Using the Surprise framework in Python
One way to explore the concepts behind recommendation engines is to use the Surprise framework. A few of the handy things about the framework are that it has built-in data sets—MovieLens and Jester—and it includes SVD and other common algorithms including similarity measures. It also includes tools to evaluate the performance of recommendations in the form of root mean squared error (RMSE) and mean absolute error (MAE), as well as the time it took to train the model.
Here is an example of how it can be used in a pseudo production situation by tweaking one of the provided examples.
First are the necessary imports to get the library loaded:
In [2]: import io ...: from surprise import KNNBaseline ...: from surprise import Dataset ...: from surprise import get_dataset_dir ...: import pandas as pd
A helper function is created to convert IDs to names:
In [3]: ...: def read_item_names(): ...: """Read the u.item file from MovieLens 100-k dataset and return two ...: mappings to convert raw ids into movie names and movie names into raw ids. ...: """ ...: ...: file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item' ...: rid_to_name = {} ...: name_to_rid = {} ...: with io.open(file_name, 'r', encoding='ISO-8859-1') as f: ...: for line in f: ...: line = line.split('|') ...: rid_to_name[line[0]] = line[1] ...: name_to_rid[line[1]] = line[0] ...: ...: return rid_to_name, name_to_rid
Similarities are computed between items:
In [4]: # First, train the algorithm # to compute the similarities between items ...: data = Dataset.load_builtin('ml-100k') ...: trainset = data.build_full_trainset() ...: sim_options = {'name': 'pearson_baseline', 'user_based': False} ...: algo = KNNBaseline(sim_options=sim_options) ...: algo.fit(trainset) ...: ...: Estimating biases using als... Computing the pearson_baseline similarity matrix... Done computing similarity matrix. Out[4]: < surprise.prediction_algorithms.knns.KNNBaseline>
Finally, ten recommendations are provided, which are similar to another example in this chapter:
In [5]: ...: rid_to_name, name_to_rid = read_item_names() ...: ...: # Retrieve inner id of the movie Toy Story ...: toy_story_raw_id = name_to_rid['Toy Story (1995)'] ...: toy_story_inner_id = algo.trainset.to_inner_iid( toy_story_raw_id) ...: ...: # Retrieve inner ids of the nearest neighbors of Toy Story. ...: toy_story_neighbors = algo.get_neighbors( toy_story_inner_id, k=10) ...: ...: # Convert inner ids of the neighbors into names. ...: toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id) ...: for inner_id in toy_story_neighbors) ...: toy_story_neighbors = (rid_to_name[rid] ...: for rid in toy_story_neighbors) ...: ...: print('The 10 nearest neighbors of Toy Story are:') ...: for movie in toy_story_neighbors: ...: print(movie) ...: The 10 nearest neighbors of Toy Story are: Beauty and the Beast (1991) Raiders of the Lost Ark (1981) That Thing You Do! (1996) Lion King, The (1994) Craft, The (1996) Liar Liar (1997) Aladdin (1992) Cool Hand Luke (1967) Winnie the Pooh and the Blustery Day (1968) Indiana Jones and the Last Crusade (1989)
In exploring this example, consider the real-world issues with implementing this in production. Here is an example of a pseudocode API function that someone in your company may be asked to produce:
def recommendations(movies, rec_count): """Your return recommendations"""
movies = ["Beauty and the Beast (1991)", "Cool Hand Luke (1967)",.. ]
print(recommendations(movies=movies, rec_count=10))
Some questions to ask in implementing this are: What trade-offs are you making in picking the top from a group of selections versus just a movie? How well will this algorithm perform on a very large data set? There are no right answers, but these are things you should think about as you deploy recommendation engines into production.
Cloud solutions to recommendation systems
The Google Cloud Platform has an example of using machine learning on Compute Engine to make product recommendations that is worth exploring. In the example, PySpark and the ALS algorithm are used along with proprietary cloud SQL. Amazon also has an example of how to build a recommendation engine using its platform, Spark, and Elastic Map Reduce (EMR).
In both cases, Spark is used to increase the performance of the algorithm by dividing the computation across a cluster of machines. Finally, AWS is heavily pushing SageMaker, which can do distributed Spark jobs natively or talk to an EMR cluster.
Real-world production issues with recommendations
Most books and articles on recommendation focus purely on the technical aspects of recommendation systems. This book is about pragmatism, and so there are some issues to talk about when it comes to recommendation systems. A few of these topics are covered in this section: performance, ETL, user experience (UX), and shills/bots.
One of the most popular algorithms as discussed is O(n_samples^2 * n_features)
or quadratic. This means that it is very difficult to train a model in real time and get an optimum solution. Therefore, training a recommendation system will need to occur as a batch job in most cases, without some tricks like using a greedy heuristic and/or only creating a small subset of recommendations for active users, popular products, etc.
When I created a user-follow recommendation system from scratch for a social network, I found many of these issues came front and center. Training the model took hours, so the only realistic solution was to run it nightly. Additionally, I later created an in-memory copy of our training data, so the algorithm was only bound on CPU, not I/O.
Performance is a nontrivial concern in creating a production recommendation system in both the short term and the long term. It is possible that the approach you initially use may not scale as your company grows users and products. Perhaps initially, a Jupyter Notebook, Pandas, and SciKit-Learn were acceptable when you had 10,000 users on your platform, but it may turn out quickly to not be a scalable solution.
Instead, a PySpark-based support vector machine training algorithm may dramatically improve performance and decrease maintenance time. And then later, again, you may need to switch to dedicated machine learning chips like TPU or the Nvidia Volta. Having the ability to plan for this capacity while still making initial working solutions is a critical skill to have to implement pragmatic AI solutions that actually make it to production.
Real-world recommendation problems: Integration with production APIs
I found many real-world problems surface in production in startups that build recommendations. These are problems that are not as heavily discussed in machine learning books. One such problem is the cold-start problem. In the examples using the Surprise framework, there is already a massive database of “correct answers.” In the real world, you have so few users or products it doesn’t make sense to train a model. What can you do?
A decent solution is to make the path of the recommendation engine follow three phases. For phase one, take the most popular users, content, or products and serve those out as a recommendation. As more UGC is created on the platform, for phase two, use similarity scoring (without training a model). Here is some hand-coded code I have used in production a couple of different times that did just that. First, I have a Tanimoto score, or Jaccard distance, by another name.
"""Data Science Algorithms""" def tanimoto(list1, list2): """tanimoto coefficient
In [2]: list2=['39229', '31995', '32015'] In [3]: list1=['31936', '35989', '27489', '39229', '15468', '31993', '26478'] In [4]: tanimoto(list1,list2) Out[4]: 0.1111111111111111
Uses intersection of two sets to determine numerical score """
intersection = set(list1).intersection(set(list2)) return float(len(intersection))/(len(list1)) +\ len(list2) - len(intersection)
Next is HBD: Here Be Dragons. Follower relationships are downloaded and converted in a Pandas DataFrame.