4 no-bull takeaways about Google's machine learning project

TensorFlow, originally designed to support Google's systems at scale, is now available for everyone to use. Here's why it matters

It's easy to look at TensorFlow, Google's freshly open-sourced machine learning project that aims to smarten our apps, and make headline-grabbing predictions that border on creepy.

But there's much more to the project, which is part of a growing ecosystem of open source machine learning systems fed by data at scale.

Here are four reasons why TensorFlow is worthy of attention.

1. It's the next generation of Google's in-house machine learning system

As outlined in Google CEO Sundar Pichai's blog post, TensorFlow was built for the same reason as many other open source solutions released by Google: To solve at scale internal problems with Google's machine learning solutions.

Another post, co-authored by senior Google fellow Jeff Dean (of BigTable and MapReduce fame), detailed how Google's earlier deep-learning system, DistBelief, ran into various limitations. Aside from being too tightly coupled to Google's internal infrastructure, it also dealt solely with neural networks. Dean further explained in a YouTube video how DistBelief was great for scalability and production training, but not as flexible for research.

TensorFlow, by contrast, can work with any gradient-based machine learning algorithm, which opens up a much broader range of uses. Written in C++ for speed, it doesn't require the developer to know anything about the underlying hardware. It also runs across multiple devices and architectures, so it's intended to scale from SoC devices, like phones, all the way up to distributed systems using dozens of GPUs.

Given how quickly hardware evolves and how much abstraction already exists between even a language like C++ and the iron it runs on, this makes sense. It's a forward-thinking strategy that allows Google to build advanced hardware that is cost-effective at scale, and have its work run well on all of it.

TensorFlow is the latest example of Google releasing some portion of its infrastructure for public use. Previously, its big release in this vein was container-orchestration tool Kubernetes, now widely regarded as a key part of the container ecosystem.

2. Having Google behind it means a lot

Open source projects with the backing of a major company like Google, especially for concepts as sophisticated as machine learning, have a far smaller chance of drying up or losing development momentum.

That kind of support tends to mean the project already enjoys some degree of internal use -- as Google claims with TensorFlow. The largest and most obvious bugs get worked out with an internally used open source project. Plus, with a company of Google's size, the project's is probably already in use in a broad range of scenarios.

Open-sourcing the project means a greater number of users can contribute back to it. Few contributors are likely to be as large as Google, but they may still have use cases that Google never dreamed of.

3. It's easy to use -- and that matters

The main barrier to using any framework for math, statistics, or machine learning is ease of use. What drew many people to Apache Spark was not only the fast in-memory processing it provided, but the relatively simple programming interface it used. IBM rewrote one of its major data-processing products, DataWorks, around Spark, claiming this cut 40 million lines of code down to 5 million.

Likewise, one of TensorFlow's proffered advantages is ease of use. In addition to being accessible from other C++ applications, it sports interfaces to Python -- including support for Ipython/Jupyter notebooks -- which is as intuitive and accessible as this sort of thing gets. Other language front ends are in the works, including Google's Go, and Python 3 support is one of the first issues flagged for a fix. All the pieces may not be there yet, but enough of them are that matter.

4. It applies pressure to make machine learning open source by default

Google's Matt Cutts made this point, citing how "entire cottage industries" like Hadoop sprang up around reconstructing the work done in Google's paper on MapReduce. "But the results still suffered from a telephone-like effect," he wrote, "as outside code ran into issues that may have already been resolved within Google."

The advantages to keeping algorithm code close to the chest are dwindling anyway. Algorithms aren't the most important component in machine learning now -- especially when it comes to machine learning as deployed in the cloud. Instead, the data those algorithms are trained on and the connections to real-world data sources are the prime movers. Consider IBM's recent purchase of the Weather Company, which includes not only a flood of real-time, real-world data, but the sensor array used to generate it.

The algorithms and frameworks used to process such data work best when they're passed through the greatest number of hands.

Copyright © 2015 IDG Communications, Inc.