[1912.00144] Learning Rate Dropout
This shows that learning rate dropout is a general technique, which has great potential in practical applications.

Abstract The performance of a deep neural network is highly dependent on its training, and finding better local optimal solutions is the goal of many optimization algorithms. However, existing optimization algorithms show a preference for descent paths that converge slowly and do not seek to avoid bad local optima. In this work, we propose Learning Rate Dropout (LRD), a simple gradient descent technique for training related to coordinate descent. LRD empirically aids the optimizer to actively explore in the parameter space by randomly setting some learning rates to zero; at each iteration, only parameters whose learning rate is not 0 are updated. As the learning rate of different parameters is dropped, the optimizer will sample a new loss descent path for the current update. The uncertainty of the descent path helps the model avoid saddle points and bad local minima. Experiments show that LRD is surprisingly effective in accelerating training while preventing overfitting.
‹Figure 1. (a) Gradient updates are trapped in a saddle point. (b) Applying learning rate dropout to training, the optimizer escapes from saddle points more quickly. Red arrow: the update of each parameter in current iteration. Red dot: the initial state. Yellow dots: the subsequent states. ×: randomly dropped learning rate. (Introduction)Figure 2. (a) BP algorithm for a neural network. Black lines represent the gradient updates to each weight parameter (e.g. wqs, wsv, wvz). (b) An example of applying standard dropout, the dropped units do not appear in both forward and back propagation during training. (c) The red line indicates that the learning rate is dropped, so the corresponding weight parameter is not updated. Note that the dropped learning rate does not affect forward propagation and gradient back propagation. At each iteration, different learning rates are dropped. (Introduction)Figure 3. Applying learning rate dropout to training. This model contains 3 parameters, so there are 8 loss descent paths to choose in each iteration. The blue dot is the model state. The solid line is the update of each parameter. The dashed line is the resulting update for the model. “×” represents dropping the learning rate. (Learning rate dropout)Figure 4. Visualization of the loss descent paths. The learning rate dropout can help Adam escape from the local minimum. “?” is optimal point (−0.74, 1.40). (Learning rate dropout)Figure 5. The learning curves for 2-layers FCNet on MNIST. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy. (Learning rate dropout)Figure 6. The learning curves for ResNet-34 on CIFAR-10. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy. (Learning rate dropout)Figure 7. The learning curves for DenseNet-121 on CIFAR-100. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy. (Experiments)Figure 8. Results for PSPNet on VOC2012 semantic segmentation dataset. Left: mean IOU. Right: Pixel Accuracy. (Image segmentation)Figure 9. Results for object detection. Left: Training loss. Right: Validation loss. (Image segmentation)Figure 10. Results obtained using different dropout rate p. Top: Adam. Bottom: SGDM. (Object detection)Figure 11. Results on CIFAR-10 using different regularization strategies. Top: Adam. Bottom: SGDM. (Effect of dropout rate)Figure 12. LRD vs. DG on CIFAR-10 (ResNet-34 is used). Top: Adam. Bottom: SGDM. (Comparison with other regularizations)Figure 13. Results obtained using different psd. (Dropout rate of standard dropout)›