[1910.07454v2] An Exponential Learning Rate Schedule for Deep Learning
The paper shows rigorously how BN allows a host of very exotic learning rate schedules in deep learning, and verifies these effects in experiments

Intriguing empirical evidence exists that deep learning can work well with exotic schedules for varying the learning rate. This paper suggests that the phenomenon may be due to Batch Normalization or BN(Ioffe & Szegedy, 2015), which is ubiquitous and provides benefits in optimization and generalization across all standard architectures. The following new results are shown about BN with weight decay and momentum (in other words, the typical use case which was not considered in earlier theoretical analyses of stand-alone BN (Ioffe & Szegedy, 2015; Santurkar et al., 2018; Arora et al., 2018) • Training can be done using SGD with momentum and an exponentially increasing learning rate schedule, i.e., learning rate increases by some (1 + α) factor in every epoch for some α > 0. (Precise statement in the paper.) To the best of our knowledge this is the first time such a rate schedule has been successfully used, let alone for highly successful architectures. As expected, such training rapidly blows up network weights, but the net stays well-behaved due to normalization. • Mathematical explanation of the success of the above rate schedule: a rigorous proof that it is equivalent to the standard setting of BN + SGD + Standard Rate Tuning + Weight Decay + Momentum. This equivalence holds for other normalization layers as well, Group Normalization(Wu & He, 2018), Layer Normalization(Ba et al., 2016), Instance Norm(Ulyanov et al., 2016), etc. • A worked-out toy example illustrating the above linkage of hyperparameters. Using either weight decay or BN alone reaches global minimum, but convergence fails when both are used.
‹

$$$$Figure 1: Taking PreResNet32 with standard hyperparameters and replacing WD during first phase (Fixed LR) by exponential LR according to Theorem ?? to the schedule e ηt = 0.1 × 1.481t , momentum 0.9. Plot on right shows weight norm w of the first convolutional layer in the second residual block grows exponentially, satisfying kwtk2 e ηt = constant. Reason being that according to the proof it is essentially the norm square of the weights when trained with Fixed LR + WD + Momentum, and published hyperparameters kept this norm roughly constant during training. (Replacing WD by Exponential LR: Case of constant LR with momentum)Figure 3: Instant LR decay is crucial when LR growth e ηt e ηt−1 − 1 is very small. The original LR of Step Decay is decayed by 10 at epoch 80, 120 respectively. In the third phase, LR growth e ηt e ηt−1 − 1 is approximately 100 times smaller than that in the third phase, it would take TEXP-hundreds of epochs to reach its equilibrium. As a result, TEXP achieves better test accuracy than TEXP--. As a comparison, in the second phase, e ηt e ηt−1 − 1 is only 10 times smaller than that in the first phase and it only takes 70 epochs to return to equilibrium. (The benefit of instant LR decay)Figure 2: PreResNet32 trained with standard Step Decay and its corresponding Tapered-Exponential LR schedule. As predicted by Theorem ??, they have similar trajectories and performances. (Replacing WD by Exponential LR: Case of multiple LR phases)Figure 6: Both Cosine and Step Decay schedule behaves almost the same as their exponential counterpart, as predicted by our equivalence theorem. The (exponential) Cosine LR schedule achieves better test accuracy, with a entirely different trajectory. (Better Exponential LR Schedule with Cosine LR)Figure 4: Instant LR decay has only temporary effect when LR growth e ηt e ηt−1 − 1 is large. The blue line uses an exponential LR schedule with constant exponent. The orange line multiplies its LR by the same constant each iteration, but also divide LR by 10 at the start of epoch 80 and 120. The instant LR decay only allows the parameter to stay at good local minimum for 1 epoch and then diverges, behaving similarly to the trajectories without no instant LR decay. (The benefit of instant LR decay)Figure 5: The orange line corresponds to PreResNet32 trained with constant LR and WD divided by 10 at epoch 80 and 120. The blue line is TEXP-corresponding to Step Decay schedule which divides LR by 10 at epoch 80 and 120. They have similar trajectories and performances by a similar argument to Theorem ??.(See Theorem ?? and its proof in Appendix ??) (The benefit of instant LR decay)(a) Input(I) (b) Linear(L) (c) Addition(+) (d) Normalization(N) (e) Bias(B) (f) Alternative Definition of Bias(B) (g) Normalization with Affine(NA) (h) Definition of Normalization with Affine(NA) (i) Degree of homogeneity of the output of basic modules given degree of homogeneity of the input. (Notations)Figure 8: Degree of homogeneity for all modules in vanilla CNNs/FC networks. (Networks without Affine Transformation and Bias)Figure 9: An example of the full network structure of ResNet/PreResNet represented by composite modules defined in Figure ??,??,??,??, where ‘S’ denotes the starting part of the network, ‘Block’ denotes a normal block with residual link, ‘D-Block’ denotes the block with downsampling, and ‘N’ denotes the normalization layer defined previously. Integer x ∈ {0, 1, 2} depends on the type of network. See details in Figure ??,??,??,??. (Networks without Affine Transformation and Bias)(a) The starting part of ResNet (b) A block of ResNet (c) A block of ResNet with downsampling (d) Degree of homogeneity for all modules in ResNet without affine transformation in normalization layer. The last normalization layer is omitted. (Networks without Affine Transformation and Bias)(a) The starting part of PreResNet (b) A block of PreResNet (c) A block of PreResNet with downsampling (d) Degree of homogeneity for all modules in ResNet without affine transformation in normalization layer. The last normalization layer is omitted. (Networks without Affine Transformation and Bias)Figure 12: Degree of homogeneity for all modules in vanilla CNNs/FC networks. (Networks with Affine Transformation)(a) The starting part of ResNet (b) A block of ResNet (c) A block of ResNet with downsampling (d) Degree of homogeneity for all modules in ResNet with trainable affine transformation. The last normalization layer is omitted. (Networks with Affine Transformation)(a) The starting part of PreResNet (b) A block of PreResNet (c) A block of PreResNet with downsampling (d) Degree of homogeneity for all modules in PreResNet with trainable affine transformation. The last normalization layer is omitted. (Networks with Affine Transformation)Figure 15: The network can be not scale variant if the GN or IN is used and the bias of linear layer is trainable. The red ‘F’ means the Algorithm ?? will return False here. (Networks with Affine Transformation)›