[1910.08525v1] Scheduling the Learning Rate via Hypergradients: New Insights and a New Algorithm
we note, however, that MARTHE is a general technique for finding online hyperparameter schedules (albeit it scales linearly with the number of hyperparameters), possibly implementing a competitive alternative in other application scenarios, such as tuning regularization parameters (Luketina et al., 2016)

We study the problem of fitting task-specific learning rate schedules from the perspective of hyperparameter optimization. This allows us to explicitly search for schedules that achieve good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate, the hypergradient, and based on this we introduce a novel online algorithm. Our method adaptively interpolates between the recently proposed techniques of Franceschi et al. (2017) and Baydin et al. (2018), featuring increased stability and faster convergence. We show empirically that the proposed method compares favourably with baselines and related methods in terms of final test accuracy.
‹Figure 1: Loss surface and trajectories for 500 steps of gradient descent with HD and RTHO for Beale function (left) and (smoothed and simplified) Bukin N. 6 (right). Center: best objective value reached within 500 iterations for various values of β that do not lead to divergence. (Structure of the Hypergradient)Figure 2: Left: schedules found by LRS-OPT (after 5000 iterations of SGD) on 4 different random seeds. Center: qualitative comparison between offline and online schedules for one random seed. For MARTHE with fixed µ, we report the best performing one. For each method, we report the schedule generated with the value of β that achieves the best average final validation accuracy. Plots for the remaining random seeds can be found in the appendix. Right: Average validation accuracy of over 20 random seeds, for various values of β. For reference, the average validation accuracy of the network trained with η = 0.01·1512 is 87.5%, while LRS-OPT attains 96.2%. (Optimized and Online Schedules)

Figure 3: Results of VGG-11 on CIFAR-10, and SGDM as the inner optimizer concerning: (Left) accuracy, (Center) loss of the objective function on the validation set and (Right) generated learning rate schedule for each method. (Experiments)Figure 4: Results of ResNet-18 on CIFAR-100, and Adam as the inner optimizer concerning: (Left) accuracy, (Center) loss of the objective function on the validation set and (Right) generated learning rate schedule for each method. (Experiments)Figure 5: Comparison between optimized and online schedules for the remaining three seeds. For each method, we report the schedule generated with the hyper-learning rate (or step-size for adapting it) that achieves the best final validation accuracy. (Optimized and Online Schedules: Additional Details)Figure 6: Sensitivity analysis of MARTHE with respect to η0 and µ fixing the value of β to 10−7 (Left) and 10−8 (Right). We used VGG-11 on CIFAR-10 with SGDM as optimizer. Darker colors mean more error; in white where the best performance is obtained. (Sensitivity analysis of MARTHE with respect to η0, µ and β)