Yahoo has created a deep learning system for creating predictive applications like speech or image recognition. While these systems are already delivered by open source projects like Google TensorFlow or Microsoft CNTK, Yahoo stands apart by leveraging a major force in big data processing: Spark.
Spark features an array of machine learning algorithms, introducing new ones in each successive revision. But deep learning -- training a neural net with a mass of data and using it to make decisions -- isn't part of its portfolio.
CaffeOnSpark addresses that by accepting data prepared by a Spark application and allowing the resulting predictions to be extracted by Spark via SQL query or its other machine learning libraries.
The Spark and Caffe nodes can sit side by side on the same hardware, meaning the data doesn't have to be moved around as much and thus can be processed faster. Training jobs can also have their state periodically checkpointed, so a long-running job can be paused and resumed, or recovered in the event of a crash.
Launching applications and running processing in CaffeOnSpark are done by way of the existing Spark command set, for the sake of familiarity. Also, the existing Spark command set launches applications and runs processing in CaffeOnSpark. But CaffeOnSpark instances running on different nodes don't communicate with each other through Spark. Instead, they use their own system, MPI, which can be routed over Ethernet or RDMA/Infiniband, to avoid bottlenecks.
The biggest advantage to CaffeOnSpark is its use of an existing big data processing tool that's already achieved a great deal of user and developer momentum. Google and Microsoft tout ease of use as chief advantages of their solutions, but familiar tools always help the transition to a new workflow or data paradigm, especially given Spark's reputation for accessibility and simplicity.