4 machine learning breakthroughs from Google's TPU processor

Google has revealed details about how its custom Tensor Processing Unit speeds up machine learning; here's how the field is set to evolve in its wake

4 machine learning breakthroughs from Google's TPU processor
Brian Rinker (Creative Commons BY or BY-SA)

Google is nothing if not ambitious about its machine learning plans. Around this time last year it unveiled its custom Tensor Processing Unit (TPU) hardware accelerator designed to run its TensorFlow machine learning framework at world-beating speeds.

Now, the company is providing details of exactly how much juice a TPU can provide for machine learning, courtesy of a paper that delves into the technical aspects. The info shows how Google's approach will influence future development of machine learning powered by custom silicon.

1. Google's TPUs, like GPUs, address division of machine learning labor

Machine learning generally happens in a few phases. First you gather data, then you train a model with that data, and eventually you make predictions with that model. The first phase doesn't typically require specialized hardware. Phase two is where GPUs come into play; in theory you can use a GPU for phase three as well.

With Google's TPU, phase three is handled by an ASIC, which is a custom piece of silicon designed to run a specific program. ASICs are good at integer calculations, which are needed when making predictions from models, while GPUs are better at floating-point math, which is vital when training models. The idea is to have specialized silicon for each aspect of the machine learning process, so each specific step can go as fast as possible.

Strictly speaking, this isn't a new approach. It's an extension of the pattern developed when GPUs were brought into the mix to speed up training. Google demonstrates a method to take the next steps with that paradigm, especially as hardware becomes more flexible and redefinable.

2. Google's TPU hardware is secret -- for now

For an operation with Google's scale and finances, custom hardware provides three advantages: It's faster, it solves the right problem at the right level, and it provides a competitive edge the company can share -- albeit on its own terms.

Right now, Google is using this custom TPU hardware to accelerate its internal systems. The feature isn't yet available through any of its cloud services. And don't expect to be able to buy the ASICs and deploy them in your boxes.

The reasons are straightforward enough. Reason one: Anything that provides Google with a distinct competitive advantage is going to be kept as close to the vest as possible. TPUs allow machine learning models to run orders of magnitude faster and more efficiently, so why give away or even sell the secret sauce?

Reason two: Google offers items to the public only after they've been given a rigorous internal shakedown. It took years for Kubernetes and TensorFlow to become publicly available, both of which Google had used extensively inside the company (though in somewhat different forms).

If anything from the TPU efforts makes it to public use, it'll be through the rent-in-the-cloud model -- and odds are it'll be a generation behind whatever the company is working on internally.

3. Google's custom-silicon approach isn't the only one

Google elected to create its own ASICs, but there's another possible approach to custom silicon for running machine learning models: FPGAs, processors that can be reprogrammed on the fly.

FPGAs can perform math at high speed and with high levels of parallelism, both of which machine learning needs at most any stage of its execution. FPGAs are also cheaper and faster to work with than ASICs out of the box, since ASICs have to be custom-manufactured to a spec.

Microsoft has twigged to the possibilities provided by FPGAs and unveiled server designs that employ them. Machine learning acceleration is one of the many duties that hardware could take on.

That said, FPGAs aren't a one-to-one solution for ASICs, and they can't be dropped into a machine learning pipeline as-is. Also, there aren't as many programming tools for FPGAs in a machine learning context as there are for GPUs.

It's likely that the best steps in this direction won't be toolkits that enable machine learning FPGA programming specifically, but general frameworks that can perform code generation for FPGAs, GPUs, CPUs, or custom silicon alike. Such frameworks would have more to work on if Google offers its TPUs as a cloud resource, but there's already plenty of targets they can start addressing right away.

4. We've barely scratched the surface with custom machine learning silicon

Google claims in its paper the speedups possible with its ASIC could be further bolstered by using GPU-grade memory and memory systems, with results anywhere from 30 to 200 times faster than a conventional CPU/GPU mix. That's without addressing what could be achieved by, say, melding CPUs with FPGAs, or any of the other tricks being hatched outside of Google.

It ought to be clear by now that custom silicon for machine learning will drive the development of both the hardware and software sides of the equation. It's also clear Google and others have barely begun exploring what's possible.