In machine learning parlance, clustering or similarity search looks for affinities in sets of data that normally don't make such a job easy. If you wanted to compare 100 million images against each other and find the ones that looked most like each other, that's a clustering job. The hard part is scaling well across multiple processors, where you'd get the biggest speedup.
Facebook's AI research division (FAIR) recently unveiled, with little fanfare, a proposed solution called Faiss. It's an open source library, written in C++ and with bindings for Python, that allows massive data sets like still images or videos to be searched efficiently for similarities.
It's also one of a growing class of machine learning solutions that's exploring better methods of making algorithms operate in parallel across multiple GPUs for speed that's only available at scale.
A magnet for the needle in the haystack
FAIR described the project and its goals in a paper published at the end of last February. The problem wasn't only how to run similarity searches, or "k-selection" algorithms, on GPUs, but how to run them effectively in parallel across multiple GPUs, and how to deal with data sets that don't fit into RAM (such as terabytes of video).
Faiss' trick is not to search the data itself, but a compressed representation that trades a slight amount of accuracy for an order of magnitude or more of storage efficiency. Think of an MP3: Though MP3 is "lossy" compression format, it sounds good enough for most ears. In the same manner, Faiss uses encoding called PQ (product quantization) that can be split efficiently across multiple GPUs.
One example search shown in the paper involves searching the Yahoo Flickr Creative Commons 100 Million data set, a library of 100 million images. Faiss was fed two images -- one of a red flower, and one of a yellow flower -- and instructed to find a chain of similar images between them. Searching all 100 million images for such similarities took 35 minutes on a set of four Nvidia Titan X GPUs.
FAIR claims Faiss is "8.5× faster than prior GPU state of the art" and provided some benchmarks to support its claim. When compared against two previous GPU k-selection algorithms, FAIR claimed, the Faiss algorithm was not only faster, but came a good deal closer to maximizing the available memory bandwidth for the GPU.
Another advantage with Faiss, said FAIR, was the total end-to-end time for the search -- the time needed to construct the PQ version of the data, plus the time needed to actually run the search. Competing solutions took days on end simply to build PQ graph data for one test; with Faiss, a "high-quality" graph can be built in "about half a day."
Pick up the pace
FAIR's strategy of slightly sacrificing accuracy is one of a variety of speedup tactics used by the latest generation of machine learning technologies.
Many of these speedups don't simply goose the performance of high-end hardware like Nvidia Titan boards, but also empower lower-end hardware, like the GPUs in smartphones. Google's deep leaning system TensorFlow was recently upgraded to allow smartphone-grade GPUs to perform image-recognition work.
Another likely long-term advantage of algorithms that can efficiently trade accuracy for speed is to divide labor between a local device (fast, but not as accurate) and a remote back end (more accurate, but requires more processing power). Classifications made by a local device could be used as-is or augmented with more horsepower on the back end if there's a network connection.
The biggest takeaway with Faiss: There's still plenty of work to be done in figuring out how machine learning of all stripes can further benefit from massively parallel hardware.