Machine learning is exciting, but the work is complex and difficult. It typically involves a lot of manual lifting -- assembling workflows and pipelines, setting up data sources, and shunting back and forth between on-prem and cloud-deployed resources.
The more tools you have in your belt to ease that job, the better. Thankfully, Python is a giant tool belt of a language that's widely used in big data and machine learning. Here are five Python libraries that help relieve the heavy lifting for those trades.
A simple package with a powerful premise, PyWren lets you run Python-based scientific computing workloads as multiple instances of AWS Lambda functions. A profile of the project at The New Stack describes PyWren using AWS Lambda as a giant parallel processing system, tackling projects that can be sliced and diced into little tasks that don't need a lot of memory or storage to run.
One downside is that lambda functions can't run for more than 300 seconds max. But if you need a job that takes only a few minutes to complete and need to run it thousands of times across a data set, PyWren may be a good option to parallelize that work in the cloud at a scale unavailable on user hardware.
Google's TensorFlow framework is taking off big-time now that it's at a full 1.0 release. One common question about it: How can I make use of the models I train in TensorFlow without using TensorFlow itself?
Tfdeploy is a partial answer to that question. It exports a trained TensorFlow model to "a simple NumPy-based callable," meaning the model can be used in Python with Tfdeploy and the the NumPy math-and-stats library as the only dependencies. Most of the operations you can perform in TensorFlow can also be performed in Tfdeploy, and you can extend the behaviors of the library by way of standard Python metaphors (such as overloading a class).
Now the bad news: Tfdeploy doesn't support GPU acceleration, if only because NumPy doesn't do that. Tfdeploy's creator suggests using the gNumPy project as a possible replacement.
Writing batch jobs is generally only one part of processing heaps of data; you also have to string all the jobs together into something resembling a workflow or a pipeline. Luigi, created by Spotify and named for the other plucky plumber made famous by Nintendo, was built to "address all the plumbing typically associated with long-running batch processes."
With Luigi, a developer can take several different unrelated data processing tasks -- "a Hive query, a Hadoop job in Java, a Spark job in Scala, dumping a table from a database" -- and create a workflow that runs them, end to end. The entire description of a job and its dependencies are created as Python modules, not as XML config files or another data format, so it can be integrated into other Python-centric projects.
If you're adopting Kubernetes as an orchestration system for machine learning jobs, the last thing you want is for the mere act of using Kubernetes to create more problems than it solves. Kubelib provides a set of Pythonic interfaces to Kubernetes, originally to aid with Jenkins scripting. But it can be used without Jenkins as well, and it can do everything exposed through the kubectl CLI or the Kubernetes API.
Let's not forget about this recent and high-profile addition to the Python world, an implementation of the Torch machine learning framework. PyTorch doesn't only port Torch to Python, but adds many other conveniences, such as GPU acceleration and a library that allows multiprocessing to be done with shared memory (for partitioning jobs across multiple cores). Best of all, it can provide GPU-powered replacements for some of the unaccelerated functions in NumPy.