One sure way to find the limits of a technology is to have it become popular. The explosion of interest in machine learning has exposed a long-standing shortcoming: Too much time and effort are spent shuttling data between different applications, and not enough is spent on the actual data processing.
Three providers of GPU-powered machine learning and analytics solutions are collaborating to find a strategy for multiple programs to access the same data on a GPU and process it in-place, without having to transform it, copy it, or execute other performance-killing processes.
Data, stay put!
Continuum Analytics, maker of the Anaconda distribution for Python; machine learning/AI specialist H2O.ai; and GPU-powered database creator MapD (now open source) have formed a new consortium, called the GPU Open Analytics Initiative (GOAI).
Their plan, as detailed in a press release and GitHub repository, is to create a common API for storing and accessing GPU-hosted data for machine learning/AI workloads. GPU Data Frame would host the data on the GPU at every step of its lifecycle: ingestion, analysis, model generation, and prediction.
By keeping everything on the GPU, data doesn't bounce to or from other parts of the system, and it can be processed faster. The GPU Data Frame also provides a common, high-level option for any data processing application—not only ML/AI applications—to talk to GPU-bound data, so there's less need for any stage in the pipeline to deal with the GPU on its own.
I love it when a pipeline comes together
Several projects are tackling the problem of getting disparate pieces of a machine learning pipeline to talk to each other as efficiently as possible. MIT's CSAIL and Standford InfoLab recently collaborated on Weld, which is described as "a common runtime for data analytics." Weld generates code on the fly using the LLVM compiler framework. That allows different libraries (such as Spark or TensorFlow) to operate on the same data in place, without having to move it around or convert it.
In theory, Weld is optimized to work with CPUs and GPUs alike, but GPU support has not yet arrived. The GPU Data Frame project, by contrast, is meant to deliver that now—but only for GPUs.
Each of the companies involved in the GOAI is touting its solution's fit in an end-to-end machine learning pipeline. MapD's GPU-powered database is meant to cover the ingestion and analysis phase; H2O.ai, the model- and prediction-generation phase; and Anaconda, the use of Python at any stage in the process.
Together they constitute one approach to a general problem: how to create an end-to-end pipeline workflow for machine learning. Baidu, for instance, has hinted that it could use Kubernetes as the underpinning for such a solution.
The GOAI focuses on enabling GPUs as the underlying processing system, although the most robust solution would be a pipeline with multiple possible hardware targets: CPUs, GPUs, ASICs, FPGAs, and so on. It's possible, though, that the GOAI's work could in time become the GPU component of such a project.