Convolutional neural networks have proven to be a powerful tool for image recognition, allowing for ever-improving results in image classification (ImageNet), object detection (COCO), and other tasks. Despite their success, convolutions are limited by their locality, i.e. their inability to consider relations between different areas of an image. On the other hand, a popular mechanism which has proven success in overcoming locality is self-attention, which has shown to be able to capture long-range interactions (e.g. Show, Attend and Tell).
In a recent paper, Attention Augmented Convolutional Networks (AACN), a team from Google Brain presents a new way to add self-attention to common Computer Vision algorithms. By combining convolutional layers and self-attention layers in a ResNet architecture, the researchers were able to achieve top results in image classification and object detection while requiring a smaller model than non-attention ResNet models. The team used the Transformer self-attention architecture first presented in 2017 in the iconic paper Attention Is All You Need, by Vaswani et al. (Vaswani is also part of the AACN team), and adapted it to the Computer Vision use case.
Transformer (Multi-Headed Self-Attention)
The Transformer architecture, which first showed state-of-the-art results on machine translation, consists of a encoder and a decoder constructed in the following way:
Instead of processing tokens one by one, attention modules receive a segment of tokens and learn the dependencies between all of them at once using three learned weight matrices – Query, Key and Value – that form an Attention Head, with the Key and Value on the encoder side and the Query on the decoder side. The Transformer network consists of multiple layers, each with several Attention Heads (and additional layers), used to learn different relationships between tokens.
As in many NLP models, the input tokens are first embedded into vectors. Due to the concurrent processing in the attention module, the model also needs to add information about the order of the tokens, a step named Positional Encoding, that helps the network learn their position. In general, this step is done with a sinusoidal function that generates a vector according to the token’s position, without any learned parameters.
The formula above is of a single self-attention head, and the Transformer model becomes multi-headed attention by using several self-attention layers in parallel. The design has been the basis of well-known models like BERT and GPT-2, which achieved state-of-the-art results in multiple NLP tasks.
AACN isn’t the first paper to consider overcoming the problem of convolutional locality by adding an additional structure to a CNN model. A successful approach to this problem was presented by Hu et al. in a 2017 paper named Squeeze-and-Excitation (SE) Networks. The SE model consists of regular convolutional blocks whose results are combined with special SE blocks, each consisting of a Squeeze function and an Excitation function:
- The Squeeze function collects global information by gathering summary data on each channel (i.e. a result of single convolution) coming out of a convolutional layer. The summary is performed through global average pooling (i.e. averaging all the values in every channel), creating a vector zc which includes the summary of C channels.
- The Excitation function aims to capture nonlinear dependencies between different channels, by using a sigmoid function and a ReLU function, with accompanying parameters (W1, W2) on the channels summary zc.
A Squeeze-and-Excitation-based ResNet won the 2017 ImageNet contest, improving on the 2016 winner’s top-5 error by 25%. Importantly, the result was achieved with little additional computation cost, and have since proven to be useful in improving ResNet performance in various Computer Vision tasks.
Augmenting CNNs with attention
While Squeeze-and-Excitation blocks operate on the result of the convolutional operation, AACN uses multi-headed attention in parallel to the convolution operation, then concatenating the results of the attention with the results of the convolution.
When using the Transformer architecture for visual attention, the positional encoding part of the architecture must be updated since there isn’t a clear sequence of words like there is in text. In AACN, the team added matrices representing the relative height and width of the attended positions.
The AACN architecture achieves top results in CIFAR-100 and ImageNet image classification, and in COCO object detection. The study also found that the multi-headed self-attention layers can be used instead of the convolutional layers, albeit with less accuracy than a combination of convolution and self-attention.
Model Size & Compute
Adding multi-headed self-attention layers to a ResNet doesn’t meaningfully increase the number of parameters because in AACN networks the convolution layers use fewer filters than in vanilla ResNet architectures. Generally speaking, AACNs usually require fewer parameters than equivalent ResNet models because the attention layer requires slightly fewer parameters than 3×3 convolutions and slightly more parameters than 1×1 convolutions (see further inquiry into memory usage in paper). When compared with a different known architecture, MnasNet, AACN requires slightly more parameters and more compute time but the margins are rather minimal (<10% parameter difference, <15% compute time difference).
The writers implemented the AA block in TensorFlow and have shared the (relatively compact) code in the paper itself.
AACN presents a new way to enhance convolutional neural networks without compromising on model size and compute. The paper shows yet another useful application of the Attention Is All You Need paper, and although multi-headed self-attention isn’t in fact “all you need” in every case, it appears that it is a powerful tool for many machine learning challenges.