BagNet – Solving ImageNet with a Simple Bag-of-features Model5 min read

Prior to 2012, most machine learning algorithms were statistical models which used hand-created features. The models were highly explainable and somewhat effective but failed to reach a high accuracy in many language and computer vision tasks. In 2012, AlexNet, a deep neural network model, won the 2012 ImageNet competition by a large margin, and ignited the deep learning revolution of the past 6 years.

Deep learning models have proven to be significantly more accurate than standard ML algorithms, assumingly because of their ability to ‘intuitively’ understand a concept without receiving hand-created features which characterize it. Unfortunately, due to their ‘intuitive’ understanding, deep learning models suffer from an explainability problem. It’s difficult to understand how a deep learning algorithm reached its conclusion, and accordingly, why it made a mistake when it did.

BagNet, a new paper from University of Tübingen (Germany), sheds new light on the tradeoff between accuracy and explainability in machine learning. It presents a model which achieves state-of-the-art results on ImageNet for non-deep learning models, comparable to results achieved by VGG-16 and surpassing AlexNet. The result could provide new insights into the capabilities of non-deep learning algorithms, and set a higher standard for both ML algorithms and challenges.


One of the most popular and well known concepts in classical machine learning is bag-of-words. When analyzing a text document in the training data, a bag-of-words algorithm counts how often each word appears in the document, while ignoring “stop words” (‘the’, ‘and’, …), and uses the result to generate features for a document. Similarly, in Computer Vision, a bag-of-features model creates a set of visual features from the training data (e.g. curves, lines, colors) and then uses the features to analyze the test data.

How BagNet Works

BagNet uses a visual bag-of-local-features model to perform ImageNet classification. The training process is performed in the following way:

  1. First, training images are divided into local q x q sub-images (patches).
  2. Each patch is encoded as a 1000-dimensional vector, representing how likely the patch is to predict each ImageNet class. To translate between a q x q x 3 image to a 1000-dimensional vector, the pixels are put through 48 ResNet blocks, each performing 1×1 or 3×3 convolutions, with a 2048-dimensional vector at the end.
  3. Applying a linear classifier on each 2048-dimensional vector, using a fully connected layer and a softmax,that converts it into a 1000-dimensional vector. Each value in the vector represents the probability of each class for a given patch (a ‘heatmap’ in BagNet terms).
  4. Calculating the average of all patch and add another softmax layer to produce the probability of each class for the entire image.
  5. To train the network, the results from the final softmax layer are compared with the actual class, performing backpropagation to set the ResNet network’s weights.

In order to classify images in the test set, the researchers divide the test images into patches, average the results of each patch, and predict the class that received the highest score. Due to the use of a linear combination (a simple average function) in the representation of each class, it is trivial to understand why the algorithm decided to classify an image to a given class.

Heat maps which indicate what pixels represent the ImageNet class of the image in the top row, in BagNet-9 (9×9 pixels), BagNet-17 (17×17 pixels), and BagNet-33 (33×33 pixels) versions. Darker colors represent the parts of each training image which are indicative of the class. (Image: BagNet)


The researchers tested BagNet in three patch size configurations – 9, 17, and 33 pixels per patch. They found that 33-pixel is the most accurate configuration with an ImageNet score of 87.6% in Top-5 Validation Performance, a result approaching that of VGG-16. The 17-pixel configuration also achieved an impressive score, with an 80.5% Top-5 result, resembling that of AlexNet.

Interestingly, due to the simple patches averaging, the researchers could easily show the cause of each mistake of the algorithm, as can be seen in the following examples:

In the top image, the green in the background makes the algorithm predict a green Granny Smith apple. In the medium image, the close-up on the thimble could look like a gas mask because of its covering of the eyes, and the miniskirt image may look like a book jacket, due to the fact that book jackets usually include large amounts of text. (Image: BagNet)

The researchers then attempted to test if common deep learning algorithms also rely on specific image patches or can gain a broader understanding of an image, where a broad understanding would mean connecting different areas of the image to a global ‘understanding’ of spatial relationships. To do so, they masked the most indicative patches of the image according to the BagNet representation, and then tested how effective the deep learning algorithms are when receiving the masked images as input.

They found that while relatively shallow neural networks like VGG-16 suffer significantly from the masking, it had a minor effect on deeper and more modern neural networks like ResNet-152 and DenseNet-169. The result hints that, as assumed, the deep layers of a neural network assist in understanding large spatial relationships.

Implementation Details & Compute

In the suggested design, BagNet models were approximately 75% slower at inference than vanilla ResNet-50 models, analyzing 155 images/sec vs 570 images/sec on the same hardware. According to the researchers, the difference may be attributed to the reduced amount of downsampling in BagNet.

The model specification and pretrained weights can be found here.


While bag-of-features type models aren’t likely to make a comeback any time soon, the BagNet results show that it is possible to create a high-quality Computer Vision baseline without deep neural networks. In the future, it’s possible that such models will be useful in cases where explainability is key (for instance medical or self-driving), or alternatively in debugging deep neural networks.

An additional takeaway, also expressed by one of the researchers, is that the research community needs better tasks than ImageNet to test algorithms’ ability to understand images in a non-local fashion.

Sign up to our monthly newsletter
Stay updated with the latest research in Deep Learning

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *