GPipe – Training Giant Neural Nets using Pipeline Parallelism6 min read

In recent years the size of machine learning datasets and models has been constantly increasing, allowing for improved results on a wide range of tasks. At the same time hardware acceleration (GPUs, TPUs) has also been improving but at a significantly slower pace. The gap between model growth and hardware improvement has increased the importance of parallelism, i.e. training a single machine learning model on multiple hardware devices. Some ML architectures, especially small models, are conducive to parallelism and can be divided quite easily between hardware devices, but in large models synchronization costs lead to degraded performance, preventing them from being used.

A new paper from Google Brain, GPipe, presents a novel technique in model parallelism which allows training of large models on multiple hardware devices with an almost 1:1 improvement in performance (paper shows 3.5x processing power on 4x hardware). The GPipe library, which will be open sourced, automatically analyzes the structure of a TensorFlow neural network model and delegates the training data and model onto multiple hardware devices, while applying a unique backpropagation optimization technique.

GPipe helps ML models include significantly more parameters, allowing for better results in training. To demonstrate the effectiveness of the technique, the Google team created a larger version of AmoebaNet, a neural network architecture presented in February 2018, with larger images (480×480) as input, and achieved state-of-the-art (SOTA) results on ImageNet, while demonstrating powerful transfer learning capabilities on CIFAR-10, CIFAR-100, and additional Computer Vision metrics.

Background

Parallelism in machine learning is commonly divided into two categories:

  • Model Parallelism – When using model parallelism in training, the machine learning model is divided across K hardware devices, with each device holding a part of the model. A naive approach to model parallelism is dividing an N-layered neural network into K devices by simply hosting N/K layers (“stages”) on each device. More sophisticated methods make sure that each device deals with similar computational complexity by analyzing the computational complexity of each layer. Standard model parallelism allows to train larger neural networks but suffers from a large hit in performance since devices are constantly waiting for each other and only one can perform calculations at a given time.
  • Data Parallelism – In data parallelism, the machine learning model is replicated across K hardware devices and a mini-batch of training samples is divided into K micro-batches. Each device performs the forward and backward pass for a micro-batch and when it finishes the process it synchronizes the updated model weights with the other devices, then calculating the updated weights of the entire mini-batch. At the end of each mini-batch calculation, the weights of the K models are all in sync (identical values).

While micro-batches allow parallelism, when using model parallelism each stage naturally still has to wait for the results of previous stages, resulting in a “Bubble” of idleness, as shown in the accompanied image. In the image, F0,i-F3,i are members of a single micro-batch across four stages and Fi,0-Fi,3 are the four micro-batch computations performed in a single stage.

Image: Model + Data Parallelism (GPipe: Huang et al. 2018)

Data parallelism can be effective despite the bubble problem but suffers from an additional problem – communication overhead. As models grow and hardware becomes faster, the requirement to sync the entire model between devices becomes a bottleneck in the training process, considerably slowing it down. The accompanying image exemplifies how in large neural networks, communication overhead constitutes a large majority of the training time.  The phenomenon of communication overhead encourages to create very large mini-batches but these are often the wrong choice for training a network and can present inferior results in production.

Image: PipeDream: Fast and Efficient Pipeline Parallel DNN Training

How GPipe works

GPipe uses both model and data parallelism, a combination commonly known as ‘pipelining’. It provides two key contributions to previous pipelining techniques – automatic parallelism and device memory optimization.

Automatic Parallelism

The software receives as input an architecture of a neural network, mini-batch size, and the number of hardware devices that will be available for the calculation. It then automatically divides the network layers into stages and the mini-batches into micro-batches, spreading them across the devices. To divide the model into K stages, GPipe estimates the cost of each layer given its activation function and the content of the training data. While the paper doesn’t detail how this is done in GPipe, a common technique is to run samples of the data through the neural network, measure the computation time of each layer, and divide accordingly. GPipe does receive as input an optional cost estimation function for each layer, allowing more sophisticated techniques to improve on its internal mechanism.

Device Memory Optimization

When computing a backward pass in a neural network, the forward pass activations of the network are required to perform the calculations. Normally this means that with a micro-batch of size N and L layers in the neural network, O(N x L) activations are kept in device memory after the forward pass, in preparation for the backward pass.

GPipe uses a different approach, applying an interesting compute-memory tradeoff – instead of keeping NxL activations in memory, it only keeps the N activations in the final layer of the stage (stage = group of layers). In this case, every time a backward pass begins (from the last layer), the forward pass activations are recomputed and kept in memory. When a backward pass of a single sample is concluded, the activations in memory are discarded and are recomputed for the backward pass of the next sample. Using this approach, the device memory only keeps one set of activations at a time, gaining valuable memory at the price of making O(N) more forward passes. Since the general hardware trend is device speed growing faster than device memory, this tradeoff is often a useful one.

GPipe Applications

In the paper, the researchers expanded AmoebaNet from 155.3 million parameters to 557 million parameters, and inserted as input 480×480 ImageNet images, as opposed to the downsampled 331×331 images used by the standard AmoebaNet model. The result was an improvement in ImageNet Top-1 Accuracy (84.3% vs 83.5%) and Top-5 Accuracy (97.0% vs 96.5%), marking a new state-of-the-art.

Using the expanded AmoebaNet model for transfer learning, the researchers achieved state-of-the-art (SOTA) results on the CIFAR-10, CIFAR-100, Oxford-IIIT Pets, and the Food-101 dataset. The model achieved results inferior to state-of-the-art results in two datasets – FGVC Aircraft and Birdsnap, which may be explained by the state-of-the-art models in these datasets leveraging 9.8 million pre-trained images from Google Image Search in addition to the ImageNet data. Note that SOTA results are considered on a clean data set and model without transfer learning, meaning the expanded AmoebaNet is very effective for transfer learning but does not represent SOTA results on non-ImageNet tests.

DatasetPrevious SOTAExpanded AmoebaNet
CIFAR-1098.5%99.0%
CIFAR-10089.3%91.3%
Oxford-IIIT Pets94.3%95.9%
Food-10190.4%93.0%
FGVC Aircraft94.5%92.7%
Birdsnap94.5%83.6%

Implementation details

GPipe is written in TensorFlow and will be open sourced. The technique is not unique to TensorFlow and can be implemented in other platforms as well.

Conclusion

As an open-source library, GPipe will allow machine learning practitioners to train much larger models at relatively low cost. It’s safe to assume that the result will be a new abundance of large machine learning models which will achieve results superior to existing models, as well as increased use of full-size image data as opposed to downscaled images.

While useful for everyone, this breakthrough in data parallelism will naturally provide a special advantage to large organizations which are in the possession of massive compute and data.

Special thanks to Yangping Huang, one of the paper’s authors, for valuable insights on the workings of GPipe.


Sign up to our weekly newsletter
Stay updated with the latest research in Deep Learning

Leave a Reply

Leave a Reply