Google unveiled the technical data and the appointment of TPU
- Transfer
Although since 2015 Google TPU (Tensor Processing Unit) provides the work of a vast empire of "deep learning" systems, very little is known about this special processor. However, not so long ago, the web giant published a description of the chip and explained why it is an order of magnitude faster and more energy efficient than the CPU and GPU that it replaces.
First, a little context. TPU is a specialized ASIC developed by Google engineers to accelerate the process of "output" (meaning getting a finished result - approx. Translator) of neural networks, its purpose is to accelerate the productive phase of these applications for already trained networks. For example, this works every time a user initiates a voice search, asks for a translation of the text, or searches for a match with the image. But at the training stage, Google uses the GPU, just like all companies using the technology of "deep learning".
The distinction is important because “output” can be done for the most part using 8-bit integer operations, while training is usually done using 32-bit or 16-bit floating point operations. As Google pointed out in its TPU analysis, when multiplying 8-bit integers, you can use six times less energy than when multiplying 16-bit floating point numbers, and for addition, it's thirteen times less.
ASIC TPU takes advantage of this by incorporating an 8-bit matrix multiplier that can perform 64K multiplication operations in parallel. At top performance, it delivers 92 trillion operations per second. The processor also has 24 MB of internal memory, which is quite a large amount for a chip of this size. However, the memory bandwidth is quite modest - 34 Gb / s. To optimize energy consumption, TPU operates at a fairly modest frequency of 700 MHz and consumes 40 watts of power. ASIC is manufactured using a 28-nanometer process technology and has a TDP of 75 watts.
In everything that relates to computer equipment, Google focuses on energy consumption, since it makes up a considerable part of the total cost of ownership (TCO) of equipment in data centers. And for large data centers, the cost of electricity can grow too quickly when the equipment is too powerful for the tasks. According to the authors of the TPU analysis from Google, "when you buy equipment in thousands of units, the cost of performance is more important than performance itself."
Another important aspect of TPU design is response time. Since the output is executed in response to user requests, the system should produce the result as quickly as possible. Therefore, developers preferred low latency over high bandwidth. For GPUs, this ratio is reversed, so they are used in the training phase, which requires a lot of computing power.
The validity of the company’s development of a special chip for “withdrawal” came to Google about six years ago, when they began to introduce “deep learning” technologies in their search engines. As these products have been used daily by millions of people, the required computing power has begun to look intimidating. For example, it turned out that if people used voice search using a neural network for only three minutes a day, the company would have to double the number of Google data centers, provided that conventional equipment was used.
Because TPUs are specifically designed for “output,” they provide much better performance and energy efficiency than Intel processors or NVIDIA GPUs. To identify TPU capabilities, Google compared it to other 2015 processors designed for “output,” namely, Intel Haswell Xeon and the NVIDIA K80 GPU. Tests were conducted on six benchmarks for the three most commonly used types of neural networks: convolutional (CNN), recurrent (RNN) and multilayer perceptrons (MLP). Relevant configurations and test results are shown in the table below.
As a result, it was discovered that the TPU ran 15 to 30 times faster than the K80 GPU and Haswell processor. Energy efficiency was even more impressive, TPU was 30-80 times ahead of competitors. Google claims that if they used GDDR5 memory with higher bandwidth in the TPU, they could triple the chip's performance.
Such results are not so surprising, considering that the K80 GPU is focused on HPC and neural network training, but is not optimized for output. As for the Xeon processors, they are not optimized for “deep learning” algorithms of any type, although in similar scenarios they are only slightly slower than the K80.
To some extent, all this is old news. The 2017 new Pascal family of NVIDIA processors outperforms the K80 by a wide margin. In terms of output, NVIDIA now offers the Tesla P4 and P40 GPUs, which, like TPUs, support 8-bit integer operations. These NVIDIA processors may not be fast enough to outperform a dedicated TPU, but the performance gap between them is likely to be significantly smaller.
In any case, TPU does not threaten NVIDIA's leadership in the field of "deep learning." The GPU maker still dominates in this area and, obviously, is going to sell many of its P4 and P40 output accelerators to large data centers. The more general NVIDIA threat in the field of “withdrawal” development is Intel, which is positioning its Altera FPGA for this type of work. So, Microsoft has already signed an Altera FPGA supply contract by deploying the world's largest AI cloud using Altera / Intel processors. And other AI service providers may also follow suit.
Almost certainly, Google is already working on its second generation TPU. This chip is likely to have a higher bandwidth memory, either GDDR5, or something even more exotic. Google engineers are likely to experiment with the logic and design of the TPU to increase the clock speed. The transition to a smaller manufacturing process, say, 14 nanometers, would make achieving these goals easier. Of course, it is quite possible that these TPUs have already been released and are being used in some part of the Google cloud - but if we find out about this, then only after a couple of years.
First, a little context. TPU is a specialized ASIC developed by Google engineers to accelerate the process of "output" (meaning getting a finished result - approx. Translator) of neural networks, its purpose is to accelerate the productive phase of these applications for already trained networks. For example, this works every time a user initiates a voice search, asks for a translation of the text, or searches for a match with the image. But at the training stage, Google uses the GPU, just like all companies using the technology of "deep learning".
The distinction is important because “output” can be done for the most part using 8-bit integer operations, while training is usually done using 32-bit or 16-bit floating point operations. As Google pointed out in its TPU analysis, when multiplying 8-bit integers, you can use six times less energy than when multiplying 16-bit floating point numbers, and for addition, it's thirteen times less.
ASIC TPU takes advantage of this by incorporating an 8-bit matrix multiplier that can perform 64K multiplication operations in parallel. At top performance, it delivers 92 trillion operations per second. The processor also has 24 MB of internal memory, which is quite a large amount for a chip of this size. However, the memory bandwidth is quite modest - 34 Gb / s. To optimize energy consumption, TPU operates at a fairly modest frequency of 700 MHz and consumes 40 watts of power. ASIC is manufactured using a 28-nanometer process technology and has a TDP of 75 watts.
In everything that relates to computer equipment, Google focuses on energy consumption, since it makes up a considerable part of the total cost of ownership (TCO) of equipment in data centers. And for large data centers, the cost of electricity can grow too quickly when the equipment is too powerful for the tasks. According to the authors of the TPU analysis from Google, "when you buy equipment in thousands of units, the cost of performance is more important than performance itself."
Another important aspect of TPU design is response time. Since the output is executed in response to user requests, the system should produce the result as quickly as possible. Therefore, developers preferred low latency over high bandwidth. For GPUs, this ratio is reversed, so they are used in the training phase, which requires a lot of computing power.
The validity of the company’s development of a special chip for “withdrawal” came to Google about six years ago, when they began to introduce “deep learning” technologies in their search engines. As these products have been used daily by millions of people, the required computing power has begun to look intimidating. For example, it turned out that if people used voice search using a neural network for only three minutes a day, the company would have to double the number of Google data centers, provided that conventional equipment was used.
Because TPUs are specifically designed for “output,” they provide much better performance and energy efficiency than Intel processors or NVIDIA GPUs. To identify TPU capabilities, Google compared it to other 2015 processors designed for “output,” namely, Intel Haswell Xeon and the NVIDIA K80 GPU. Tests were conducted on six benchmarks for the three most commonly used types of neural networks: convolutional (CNN), recurrent (RNN) and multilayer perceptrons (MLP). Relevant configurations and test results are shown in the table below.
As a result, it was discovered that the TPU ran 15 to 30 times faster than the K80 GPU and Haswell processor. Energy efficiency was even more impressive, TPU was 30-80 times ahead of competitors. Google claims that if they used GDDR5 memory with higher bandwidth in the TPU, they could triple the chip's performance.
Such results are not so surprising, considering that the K80 GPU is focused on HPC and neural network training, but is not optimized for output. As for the Xeon processors, they are not optimized for “deep learning” algorithms of any type, although in similar scenarios they are only slightly slower than the K80.
To some extent, all this is old news. The 2017 new Pascal family of NVIDIA processors outperforms the K80 by a wide margin. In terms of output, NVIDIA now offers the Tesla P4 and P40 GPUs, which, like TPUs, support 8-bit integer operations. These NVIDIA processors may not be fast enough to outperform a dedicated TPU, but the performance gap between them is likely to be significantly smaller.
In any case, TPU does not threaten NVIDIA's leadership in the field of "deep learning." The GPU maker still dominates in this area and, obviously, is going to sell many of its P4 and P40 output accelerators to large data centers. The more general NVIDIA threat in the field of “withdrawal” development is Intel, which is positioning its Altera FPGA for this type of work. So, Microsoft has already signed an Altera FPGA supply contract by deploying the world's largest AI cloud using Altera / Intel processors. And other AI service providers may also follow suit.
Almost certainly, Google is already working on its second generation TPU. This chip is likely to have a higher bandwidth memory, either GDDR5, or something even more exotic. Google engineers are likely to experiment with the logic and design of the TPU to increase the clock speed. The transition to a smaller manufacturing process, say, 14 nanometers, would make achieving these goals easier. Of course, it is quite possible that these TPUs have already been released and are being used in some part of the Google cloud - but if we find out about this, then only after a couple of years.