Pruning is a well known Machine Learning technique in which unnecessary weights are removed from a neural network model after training. In some cases, pruning can reduce model sizes by more than 90% without compromising on model accuracy while potentially offering a significant reduction in inference memory usage (see some great examples here).
In 2018, MIT researchers Jonathan Frankle and Michael Carbin published a groundbreaking paper – The Lottery Ticket Hypothesis – which showed a new implication of neural network pruning. The researchers were curious to see if the neurons that remain in a pruned neural network could be enough to train the network in the first place, implying that the pruned weights were unnecessary for the training.
They found that, at least in some cases, it is indeed possible to start with a smaller network (“a winning lottery ticket”) and train to a similar accuracy as its larger, unpruned, counterpart. While it’s not clear how to discover “winning tickets” without already training the full network and then pruning, the idea that such configurations exist is intriguing and could lead to future findings on how to better initialize neural network weights, leading to more effective and efficient results.
A recent paper, Stabilizing the Lottery Ticket Hypothesis (Frankle, Dziugaite, Roy, Carbin), looks deeper into the Lottery Ticket Hypothesis, presenting a technique to find “winning tickets” in deep networks in a more methodical way using a ‘stability’ metric.
The Lottery Ticket Hypothesis
The Lottery Ticket finding brought up the question – how do you structure a small neural network and how do you initialize its weights so that after training it’s as accurate as a larger network with the same architecture?
In the paper’s terminology, training a small network accurately is “winning the lottery”, and the correct weight initialization is the “winning ticket”. The study found that while almost all initial configurations lead to reduced model accuracy (“losing tickets”), there exist rare configurations which do lead to successful training. One way of finding such “winning tickets” is to take all the neurons that survived the pruning of a network and observe their initial weights (i.e. their weights from the beginning of the training process). When you keep only the unpruned connections and initialize the neurons with their pre-training weights, it’s possible to train a smaller network (10%-20% of the original network size in some cases) to the same accuracy as the larger network, at times even improving on the original accuracy.
The result was replicated on fully connected networks (LeNet) for MNIST and with convolutional neural networks (small VGG models) for CIFAR-10. Unfortunately, the same technique proved to be less effective with deeper neural networks (ResNet-18 and VGG-19), sometimes requiring hyperparameter tweaking and sometimes failing entirely. Despite the difficulties with deeper networks, Frankle and Carbin suggested the Lottery Ticket Hypothesis – every pruned neural network has some weight initialization (a “winning ticket”) which would allow it to train to an accuracy equivalent to that of its non-pruned network (“win the lottery”).
In the 2018 Frankle and Carbin paper, winning tickets were only discovered by looking at the weights of the neuron at initialization. In the 2019 Frankle et al paper, the researchers presented a new technique called Iterative Magnitude Pruning (IMP) with Rewinding. The technique is similar to the Frankle and Carbin idea with one difference – instead of looking at the weights of neurons at initialization (“rewinding to iteration zero”), it instead looks at their weights after several training iterations (described in the paper as “rewinding to iteration k”). The rationale behind the technique is that in many architectures, some neurons are already at a “winning” weight at the beginning while others only reach a “winning” weight after some training. The paper represents the algorithm in the following formula, with “vanilla IMP” as IMP with k=0, while IMP with Rewinding is with T >> k > 0.
Interestingly, Frankle et al find that for several deep architectures – Resnet-50, Squeezenet, and Inception-v3 – it’s possible to find winning tickets at early points in training (0.1%-7%) that are 50% to 99% smaller while achieving an accuracy equivalent to the non-pruned network.
The result prompted an update to the Lottery Ticket Hypothesis, which is less powerful but perhaps more accurate – The Lottery Ticket Hypothesis with Rewinding. This hypothesis says that in neural networks it’s possible to find a “winning ticket” at a k which is materially smaller than the total number of iterations required to train the network (k << T* in the following formula). A formal description of the hypothesis:
In order to better characterize winning tickets, Frankle et al present a term of ‘stability’, which represents how much the weights of neurons are impacted by the masking of neighboring neurons. If a neuron reaches the point where it’s stable, i.e. not impacted by the masking of neighboring neurons, it becomes a good candidate for a winning ticket. While it’s still changing rapidly during training, i.e. “unstable”, it’s unlikely to be part of a “winning ticket”. According to Frankle et al, it makes the most sense to search for winning tickets at an iteration where the unpruned weights have already stabilized.
The researchers look at two forms of stability – ‘stability to pruning’ and ‘stability to noise’, where noise takes into consideration the inherent randomness of stochastic gradient descent (SGD). A formal description of stability:
The researchers used the IMP with Rewinding technique with several popular architectures, tackling the ubiquitous ImageNet challenge. The results are impressive:
|k as % of total iterations||4.4%||3.5%||6.6%|
|Size of pruned network achieving equivalent accuracy||30% of unpruned network||30% of unpruned network||50% of unpruned network|
Interestingly, using the same technique at slightly lower levels of pruning the three network architectures actually results in networks outperforming their unpruned accuracy.
The researchers attempted to replicate the results with k=0, but were unable to achieve accuracy comparable to the original network, putting in question the original Lottery Ticket Hypothesis (though not disproving it).
The Lottery Ticket Hypothesis presents an interesting view into the mechanics of neural network training and a possible view into ways to more efficiently train networks. The new concept of stability to pruning and noise offers additional insight into the creation of “winning tickets”, potentially improving our understanding of “winning tickets” and how they’re created.