[1911.13299] What's Hidden in a Randomly Weighted Neural Network?
We hope that our findings serve as a useful step in the pursuit the understanding of the optimization of neural networks.

Abstract Training a neural network is synonymous with learning the values of the weights. In contrast, we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 [28] we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 [8] trained on ImageNet [3]. Not only do these “untrained subnetworks” exist, but we provide an algorithm to effectively find them. We empirically show that as randomly weighted neural networks with fixed weights grow wider and deeper, an “untrained subnetwork” approaches a network with learned weights in accuracy.
‹Figure 1. If a neural network with random weights (center) is sufficiently overparameterized, it will contain a subnetwork (right) that perform as well as a trained neural network (left) with the same number of parameters. (Introduction)Figure 2. In the edge-popup Algorithm, we associate a score with each edge. On the forward pass we choose the top edges by score. On the backward pass we update the scores of all the edges with the straight-through estimator, allowing helpful edges that are “dead” to re-enter the subnetwork. We never update the value of any weight in the network, only the score associated with each weight. (Introduction)Figure 3. Going Deeper: Experimenting with shallow to deep neural networks on CIFAR-10 [12]. As the network becomes deeper, we are able to find subnetworks at initialization that perform as well as the dense original network when trained. The baselines are drawn as a horizontal line as we are not varying the % of weights. When we write Weights ∼ D we mean that the weights are randomly drawn from distribution D and are never tuned. Instead we find subnetworks with size (% of Weights)/100 * (Total # of Weights). (The edge-popup Algorithm and Analysis)Figure 4. Going Wider: Varying the width (i.e. number of channels) of Conv4 and Conv6 for CIFAR-10 [12]. When Conv6 is wide enough, a subnetwork of the randomly weighted model (with %Weights = 50) performs just as well as the full model when it is trained. (Experimental Setup)Figure 5. Varying the width of Conv4 on CIFAR-10 [12] while modifying k so that the # of Parameters is fixed along each curve. c1, c2, c3 are constants which coincide with # of Parameters for k = [30, 50, 70] for width multiplier 1. (Varying the Width)Figure 6. Comparing the performance of edge-popup with the algorithm presented by Zhou et al. [29] on CIFAR-10 [12]. (Varying the Width)

Figure 7. Testing different weight distributions on CIFAR-10 [12]. (Comparing with Zhou et al. [29])Figure 8. Testing our Algorithm on ImageNet [3]. We use a fixed k = 30%, and find subnetworks within a randomly weighted ResNet-50 [8], Wide ResNet-50 [28], and ResNet-101. Notably, a randomly weighted Wide ResNet-50 contains a subnetwork which is smaller than, but matches the performance of ResNet-34. Note that for the non-dense models, # of Parameters denotes the size of the subnetwork. (Effect of The Distribution)Figure 9. Examining the effect of % weights on ImageNet for edge-popup and the method of Zhou et al. (ImageNet [3] Experiments)Figure 10. Examining the effect of using the “Scaled” initialization detailed in Section ?? on ImageNet. (ImageNet [3] Experiments)