[1910.05446v1] On Empirical Comparisons of Optimizers for Deep Learning
Although we made every attempt to conduct realistic experiments, we should only expect our detailed findings to hold for similar workloads under similar protocols, namely uniform quasi-random tuning for tens to hundreds of trials, over hypercube search spaces, and with our specific learning rate schedule parameterization
Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the metaparameter tuning protocol. Our findings suggest that the metaparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when metaparameter search spaces are changed. As tuning effort grows without bound, more general optimizers should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but recent attempts to compare optimizers either assume these inclusion relationships are not practically relevant or restrict the metaparameters in ways that break the inclusions. In our experiments, we find that inclusion relationships between optimizers matter in practice and always predict optimizer comparisons. In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent. We also report practical tips around tuning often ignored metaparameters of adaptive gradient methods and raise concerns about fairly benchmarking optimizers for neural network training.
‹Figure 1: The relative performance of optimizers is consistent with the inclusion relationships, regardless of whether we compare final validation error (top) or test error (bottom). For all workloads, we tuned the metaparameters of each optimizer separately, and selected the trial that achieved the lowest final validation error. (Overview of Workloads and Experimental Details)Figure 2: The relative training speed of optimizers is consistent with the inclusion relationships. We measured (idealized) training speed as the number of training steps required to reach a target validation error (see Table ?? for the error targets). (Overview of Workloads and Experimental Details)Figure 3: Tuning more metaparameters removes the differences in test error between optimizers observed by Wilson et al. (2017). Tuning a subset of optimizer metaparameters and the initial learning rate is sufficient to equalize performance between all optimizers (left). More extensive metaparameter tuning in our setup, including the learning rate schedule, improves results for all optimizers and still does not produce any differences between optimizer performances (right). (Reconciling disagreements with previous work)Figure 4: Tuning more metaparameters changes optimizer rankings from Schneider et al. (2019) to rankings that are consistent with the inclusion relationships. The leftmost columns for each workload reproduce the rankings from Schneider et al. (2019), while the remaining columns tune over increasingly general search spaces. All columns use our random search tuning protocol. (Reconciling disagreements with previous work)Figure 5: Example plot of final validation error projected onto the axes of the metaparameter space. We consider this search space to be appropriate because the optimal values are away from the search space boundaries. (Additional plots)Figure 6: Validation performance of the best trial mostly converges with as few as 24 metaparameter tuning trials for Transformer on LM1B. Shaded regions indicate 5th and 95th percentiles estimated with bootstrap sampling (see Appendix ??). The search spaces can be found in Appendix ??. (Additional plots)Figure 7: Validation performance of the best trial mostly converges with as few as 24 metaparameter tuning trials for ResNet-50 in ImageNet. Shaded regions indicate 5th and 95th percentile estimated with bootstrap sampling (see Appendix ??). The search spaces can be found in ??. (Additional plots)Figure 8: Test performance of the best trial mostly converges with as few as 23 metaparameter tuning trials for a 2-layer LSTM on War and Peace. Shaded regions indicate 5th and 95th percentile estimated with bootstrap sampling (see Appendix ??). The search spaces can be found in ??. (Additional plots)
Figure 9: The relative performance of optimizers is consistent with the inclusion relationships when we select for lowest training loss. Note that SGD, ADAM, and NADAM for ResNet-50 on ImageNet used label smoothing in their final search spaces (see Section ??), which makes their loss values incommensurate with the other optimizers. This is because their final search spaces were optimized to minimize validation error—if we had optimized their search spaces to minimize training error instead, we would not have used label smoothing, and we expect their training loss values would be consistent with the inclusion relationships. (Additional plots)Figure 10: Our results confirming the relevance of optimizer inclusion relationships do not depend on the exact step budgets or error targets we chose. (Additional plots)›