A classification technique called Learning to Rank (LTR) is used to perfect search results based on things like actual usage patterns. LTR isn’t an algorithm unto itself. The actual ranking is often done with an algorithm called support vector machine (SVM), but recently gradient-boosting trees (GBTs) have been used instead.
There are multiple implementations of Learning to Rank available. The most famous open source implementations are XGBoost, RankLib, and part of Apache Solr, which was donated by Bloomberg. (I work for Lucidworks, which is the primary sponsor of the Solr project.)
Because LTR is based on a machine learning classification model, it is a supervised learning method. This means that you train a model based on some preprocessed data. This data with hand-labeled ranks is then used to optimize future results.
There are some drawbacks to LTR, particularly when used with clicks on search results. For example, when offered all the world’s bountiful harvest, users tend to pick the thing on the top. Meaning, if Google stopped offering anything besides the “I’m Feeling Lucky” button, user behavior would largely remain unchanged (except the bounce rate would be lower). This can bias results toward what your search is already returning, therein reducing the point. However, you can address this through simple multiplication of the clicks on the second and third results (and so on).
Modern search engines are actually pretty good at relevance. So LTR is unlikely to be a whole order of magnitude better over relevant search. And there are other means of achieving good results—such as boosting the second or third result if it gets a relatively high number of clicks. Moreover, LTR and any supervised model takes time and resources.
How to use LTR
LTR is most useful in combination with these other methods. The actual improvement from using LTR will vary wildly with what your data is like, the use case, and how good your ordering already is.
To use LTR, you identify features of your corpus. This means key fields, the score returned from the search engine (such as TF-IDF or BM25), and an ordering by the number of clicks. Using this, you train the model with a given query. After that, LTR lets you rerank the results based on which features were more important given the training sets.
Ultimately, like any machine learning technique, LTR is only as good as your model. That means that it is critical to select “good” features. It may also be necessary to do things to normalize the model. For example, if on your website you find that regardless of whatever the first result is, it always gets ten times the clicks as the second result, maybe in your model a second result that gets half as many clicks as the first result ought to be ordered first.
Where to learn how to use LTR
You can try LTR yourself:
- There is a tutorial in Solr’s documentation.
- There is a Coursera module on Learning to Rank.
- You can grab Stanford University’s presentation on Learning to Rank, which goes into some of the algorithms used.
- If you’re looking for data to try this on, you’ll have to massage the data; the well-worn Best Buy dataset on Kaggle is a good place to start.