Chinese search engine giant Baidu now has an open source project in the same vein: a machine learning system it claims is easier to train and use because it exposes its functions through Python libraries.
PaddlePaddle -- "Paddle" stands for "PArallel Distributed Deep LEarning" -- was developed by Baidu to augment many of its own products with deep learning.
Baidu touted PaddlePaddle's speech transcription in Chinese, either for transcribing broadcasts or as a speech-to-text system to replace keyboards in smartphones. The company claims it needed 20,000 hours of audio as training material to achieve these results with its framework.
PaddlePaddle's core libraries are written in C++ for maximum speed, employing GPU and Intel SSE/AVX accelerations where available. The user can program directly to the C++ libraries, but PaddlePaddle provides a Python library, PyDataProvider2, that removes much of the heavy lifting from the training process.
PyDataProvider has advantages other than the general convenience of working in Python. According to the module documentation, it "uses multithreading and a fascinating but simple cache strategy to optimize the efficiency of the data providing process." The developer uses a Python decorator (
@provider) to specify a function as a data source, and PyDataProvider handles everything else, including parallelizing the data transfer process.
Training models and creating predictions are handled in a similarly high-level manner. With a convolutional neural network, for instance, the network settings are defined by Python objects. Training can be distributed across a cluster of machines, with or without GPUs. A few sample projects have been included with PaddlePaddle, including links to pertinent datasets.
Much of the model for PaddlePaddle's functions is drawn from other machine learning frameworks with Python front ends, such as Scikit-learn. But that framework doesn't have native GPU support, and it doesn't appear to have native support for aggregating work across multiple nodes of a compute cluster.
Wei Xu, Distinguished Scientist at Baidu and leader of PaddlePaddle development, stated in an email that another of PaddlePaddle's goals is to create models with significantly less code. "Take machine translation model as an example," Xu wrote. "You only need to write 25 percent of code than what you would on some other popular platforms. PaddlePaddle allows you to apply existing models to new problems without worrying [about] the math equations used to implement the model."
Baidu has plans in the works for adding support for other languages when performing predictions, but Xu said there are currently no intentions to support anything other than Python for model training. "For prediction, the language needs to align with the language used by the products, so it’s important to support different languages for prediction. Training is an offline process, and we think Python alone is a good language for this purpose."
This isn't the first time Baidu has released an open source project. Earlier projects from its labs include the Baidu File System (used as the storage layer for their query system, which is powered in part by Apache Spark); Galaxy, a cluster manager; and Shuttle, a computational framework in the MapReduce mold.
PaddlePaddle stands out not only because it's the first machine learning project offered by the company, but because it's apparently the first of Baidu's open source projects to invite participation from English-speaking developers. PaddlePaddle's documentation and examples are bilingual, while most of the previous projects have documentation primarily or entirely in Chinese.