[1912.02292] Deep Double Descent: Where Bigger Models and More Data Hurt
We provide extensive evidence for our hypothesis in modern deep learning settings, and show that it is robust to choices of dataset, architecture, and training procedures

We show that a variety of modern deep learning tasks exhibit a “double-descent” phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance.
‹Figure 3: Left: Test error as a function of model size and train epochs. The horizontal line corresponds to model-wise double descent–varying model size while training for as long as possible. The vertical line corresponds to epoch-wise double descent, with test error undergoing double-descent as train time increases. Right Train error of the corresponding models. All models are Resnet18s trained on CIFAR-10 with 15% label noise, data-augmentation, and Adam for up to 4K epochs. (Introduction)Figure 4: Test loss (per-token perplexity) as a function of Transformer model size (embedding dimension dmodel) on language translation (IWSLT‘14 German-to-English). The curve for 18k samples is generally lower than the one for 4k samples, but also shifted to the right, since fitting 18k samples requires a larger model. Thus, for some models, the performance for 18k samples is worse than for 4k samples. (Introduction)Figure 5: Model-wise double descent for ResNet18s. Trained on CIFAR-100 and CIFAR-10, with varying label noise. Optimized using Adam with LR 0.0001 for 4K epochs, and data-augmentation. (Model-wise Double Descent)Figure 6: Without data augmentation. Figure 7: With data augmentation. Figure 8: Effect of Data Augmentation. 5-layer CNNs on CIFAR10, with and without dataaugmentation. Data-augmentation shifts the interpolation threshold to the right, shifting the test error peak accordingly. Optimized using SGD for 500K steps. See Figure ?? for larger models. (Model-wise Double Descent)Figure 9: Noiseless settings. 5-layer CNNs on CIFAR-100 with no label noise; note the peak in test error. Trained with SGD and no data augmentation. See Figure ?? for the early-stopping behavior of these models. (Model-wise Double Descent) (Model-wise Double Descent)Figure 10: Left: Training dynamics for models in three regimes. Models are ResNet18s on CIFAR10 with 20% label noise, trained using Adam with learning rate 0.0001, and data augmentation. Right: Test error over (Model size × Epochs). Three slices of this plot are shown on the left. (Epoch-wise Double Descent)Figure 11: ResNet18 on CIFAR10. Figure 12: ResNet18 on CIFAR100. Figure 13: 5-layer CNN on CIFAR 10. Figure 14: Epoch-wise double descent for ResNet18 and CNN (width=128). ResNets trained using Adam with learning rate 0.0001, and CNNs trained with SGD with inverse-squareroot learning rate. (Epoch-wise Double Descent)Figure 15: Model-wise double descent for 5-layer CNNs on CIFAR-10, for varying dataset sizes. Top: There is a range of model sizes (shaded green) where training on 2× more samples does not improve test error. Bottom: There is a range of model sizes (shaded red) where training on 4× more samples does not improve test error. Figure 16: Sample-wise non-monotonicity. Test loss (per-word perplexity) as a function of number of train samples, for two transformer models trained to completion on IWSLT’14. For both model sizes, there is a regime where more samples hurt performance. Compare to Figure ??, of model-wise double-descent in the identical setting. Figure 17: Sample-wise non-monotonicity. (Sample-wise Non-monotonicity)Figure 18: Left: Test Error as a function of model size and number of train samples, for 5-layer CNNs on CIFAR-10 + 20% noise. Note the ridge of high test error again lies along the interpolation threshold. Right: Three slices of the left plot, showing the effect of more data for models of different sizes. Note that, when training to completion, more data helps for small and large models, but does not help for near-critically-parameterized models (green). (Sample-wise Non-monotonicity)Figure 19: 5-layer CNNs Figure 20: ResNet18s Figure 21: Transformers Figure 22: Scaling of model size with our parameterization of width & embedding dimension. (Models)Figure 23: Random Fourier Features on the Fashion MNIST dataset. The setting is equivalent to two-layer neural network with e−ix activation, with randomly-initialized first layer that is fixed throughout training. The second layer is trained using gradient flow. (Random Features: A Case Study)Figure 24: Sample-wise double-descent slice for Random Fourier Features on the Fashion MNIST dataset. In this figure the embedding dimension (number of random features) is 1000. (Random Features: A Case Study)Figure 25: Constant learning rate Figure 26: Inverse-square root learning rate Figure 27: Dynamic learning rate Figure 28: Epoch-wise double descent for ResNet18 trained with Adam and multiple learning rate schedules (Epoch-wise Double Descent: Additional results)Figure 29: Constant learning rate Figure 30: Inverse-square root learning rate Figure 31: Dynamic learning rate Figure 32: Epoch-wise double descent for ResNet18 trained with SGD and multiple learning rate schedules (Epoch-wise Double Descent: Additional results)Figure 33: Constant learning rate Figure 34: Inverse-square root learning rate Figure 35: Dynamic learning rate Figure 36: Epoch-wise double descent for ResNet18 trained with SGD+Momentum and multiple learning rate schedules (Epoch-wise Double Descent: Additional results)Figure 37: Top: Train and test performance as a function of both model size and train epochs. Bottom: Test error dynamics of the same model (ResNet18, on CIFAR-100 with no label noise, data-augmentation and Adam optimizer trained for 4k epochs with learning rate 0.0001). Note that even with optimal early stopping this setting exhibits double descent. (Clean Settings With Model-wise Double Descent)Figure 38: Top: Train and test performance as a function of both model size and train epochs. Bottom: Test error dynamics of the same models. 5-Layer CNNs, CIFAR-100 with no label noise, no data-augmentation Trained with SGD for 1e6 steps. Same experiment as Figure ??. (Clean Settings With Model-wise Double Descent)Figure 39: Left: Test error dynamics with weight decay of 5e-4 (bottom left) and without weight decay (top left). Right: Test and train error and test loss for models with varying amounts of weight decay. All models are 5-Layer CNNs on CIFAR-10 with 10% label noise, trained with data-augmentation and SGD for 500K steps. (Weight Decay)Figure 40: Test Error Figure 41: Train Loss Figure 42: Generalized double descent for weight decay. We found that using the same initial learning rate for all weight decay values led to training instabilities. This resulted in some noise in the Test Error (Weight Decay × Epochs) plot shown above. (Weight Decay)Figure 43: Model-wise test error dynamics for a subsampled IWSLT‘14 dataset. Left: 4k samples, Right: 18k samples. Note that with optimal early-stopping, more samples is always better. (Early Stopping does not exhibit double descent)Figure 44: Model-wise test error dynamics for a IWSLT‘14 de-en and subsampled WMT‘14 en-fr datasets. Left: IWSLT‘14, Right: subsampled (200k samples) WMT‘14. Note that with optimal early-stopping, the test error is much lower for this task. (Early Stopping does not exhibit double descent)Figure 45: Top: Train and test performance as a function of both model size and train epochs. Bottom: Test error dynamics of the same model (CNN, on CIFAR-10 with 10% label noise, dataaugmentation and SGD optimizer with learning rate ∝ 1/ √ T). (Early Stopping does not exhibit double descent)Figure 46 (Training Procedure) (Training Procedure)Figure 47: Effect of Ensembling (ResNets, 15% label noise). Test error of an ensemble of 5 models, compared to the base models. The ensembled classifier is determined by plurality vote over the 5 base models. Note that emsembling helps most around the critical regime. All models are ResNet18s trained on CIFAR-10 with 15% label noise, using Adam for 4K epochs (same setting as Figure ??). Test error is measured against the original (not noisy) test set, and each model in the ensemble is trained using a train set with independently-sampled 15% label noise. (Ensembling)Figure 48: Effect of Ensembling (CNNs, no label noise). Test error of an ensemble of 5 models, compared to the base models. All models are 5-layer CNNs trained on CIFAR-10 with no label noise, using SGD and no data augmentation. (same setting as Figure ??). (Ensembling)Figure 49: 5-layer CNNS on CIFAR, with data-augmentation. (Ensembling)›