WavebreakMediaMicro - Fotolia


Tensor Processing Units were purpose-built for machine learning: Pros, cons

Google's Tensor Processing Units are built to train and run machine learning models. Experts discuss their plusses and minuses compared to CPUs and GPUs.

As machine learning becomes a staple technology for more IT enterprises, IT executives need more compute power...

to handle the workloads.

Tech companies are racing to deliver the capabilities they will need.

That includes Google, which enters the market with its Tensor Processing Units (TPUs) available on Google Cloud.

According to Google, its Tensor Processing Units deliver significantly higher performance and higher performance per watt than the central processing units (CPUs) and graphical processing units (GPUs) that most organizations currently use for their machine learning workloads.

The TPU chip is an application-specific integrated circuit that Google developed specifically to handle machine learning workloads.

"It's one of the few chips out there that's designed from the ground up to run machine learning algorithms," said Mark Hung, a research vice president at Gartner Inc.

First, second generation Tensor Processing Units

Google said its own requirements drove the development of TPUs -- both the company's earlier first-generation TPU as well as the second-generation TPU that was announced in May 2017.

"While our first TPU was designed to run machine learning models quickly and efficiently -- to translate a set of sentences or choose the next move in [the board game] Go -- those models still had to be trained separately. Training a machine learning model is even more difficult than running it, and days or weeks of computation on the best available CPUs and GPUs are commonly required to reach state-of-the-art levels of accuracy," Google stated in a May 17, 2017, blog.

Although its research and engineering teams have made "great progress" in scaling the difficult task of training machine learning models using readily-available hardware, the blog post continued, the first-generation TPU "wasn't enough to meet our machine learning needs." Google's new machine learning system was built to eliminate bottlenecks and maximize overall performance, using second-generation TPUs to both train and run machine learning models, the company touted.

Mark HungMark Hung

Google did not respond to an interview request to speak in more detail about its product.

CIOs who are developing and running machine learning applications on Google's Cloud TPUs will find, based on Google's announced speeds, that the TPUs mean those applications are more productive during development as well as during run time, Hung said.

TPUs vs. GPUs and CPUs

Currently, many organizations are running their machine learning workloads on CPUs or GPUs.

But experts pointed out that TPUs are different from the more general CPUs and GPUs, which are primarily built for 3D applications, in that Google designed TPUs to be optimized for machine learning applications. More specifically, experts said that TPUs are built to handle training and inference, and to do so faster than other technologies.

Hadi EsmaeilzadehHadi Esmaeilzadeh

The general purpose platforms don't provide the performance needed for AI's huge demand for compute power," said Hadi Esmaeilzadeh, an assistant professor in the School of Computer Science at Georgia Tech College of Computing.

The race is on, however, to meet the need. Indeed, Google isn't the only vendor offering a new generation of chips to handle machine learning and artificial intelligence (AI), said Qirong Ho, vice president of engineering at Petuum Inc., based in Pittsburgh, Penn. "All the chip makers are making plays in this."

ARM Ltd., Advanced Micro Devices Inc., Intel, Nvidia, Qualcomm Technologies and multiple startups all have hardware for machine learning and artificial intelligence workloads, Ho said.

Experts said it's too early for organizations to compare products and weigh their advantages and disadvantages.

Qirong HoQirong Ho

However, Ho noted that because Tensor Processing Units are specifically designed for machine learning, IT organizations may not be able to use them for other workloads, thereby limiting the flexibility they have to shift work around.

"It's a highly specialized device," Ho said, pointing out that GPUs can handle machine learning and nonmachine learning workloads.

Experts also noted that IT executives are still in the dark on cost and ROI, as pricing and cost per workloads using Google's Cloud TPU aren't yet announced. They added that IT executives are also still awaiting how much other compute options will cost them, leaving them unable to do price comparisons yet.

Additionally, any organization that wants or needs to run their machine learning tasks on premises won't be able to leverage Google's TPUs, as Google is only offering it via its cloud.

Moreover, Esmaeilzadeh said the machine learning and AI space is moving so fast that it's too early for anyone to bank on any one technology.

"It's not just about what is running faster now," he said, "but what are the algorithmic breakthroughs that will happen in upcoming years."

Next Steps

AI in the ER

How will artificial intelligence change the enterprise?

Deep learning makes a comeback

Dig Deeper on AI infrastructure