[1912.07559v1] A Deep Neural Network's Loss Surface Contains Every Low-dimensional Pattern
In this paper, we provided a proof of a somewhat surprising property first empirically observed by Skorokhodov and Burtsev (2019): The loss surfaces of deep neural networks contain every low-dimensional pattern and • this property holds for any dataset, • the pattern locations transfer from train to test set as well as to other datasets with the same loss and P(y), • finding such patterns is not harder than regular supervised learning, • the patterns can be guaranteed to be axis aligned, • the patterns can be modified to have loss value epsilon away from global minima of the original problem

Abstract The work “Loss Landscape Sightseeing with Multi-Point Optimization” (Skorokhodov and Burtsev, 2019) demonstrated that one can empirically find arbitrary 2D binary patterns inside loss surfaces of popular neural networks. In this paper we prove that: (i) this is a general property of deep universal approximators; and (ii) this property holds for arbitrary smooth patterns, for other dimensionalities, for every dataset, and any neural network that is sufficiently deep and wide. Our analysis predicts not only the existence of all such low-dimensional patterns, but also two other properties that were observed empirically: (i) that it is easy to find these patterns; and (ii) that they transfer to other data-sets (e.g. a test-set).
‹Figure 1. Visualisation of the construction from each theorem. Blue blocks correspond to parts of the network trained to solve the underlying task (mapping from x to y). Green blocks correspond to parts of the network used to approximate the target loss surface. Pink blocks are the losses one minimises. Black blocks represent fixed values, no longer trained. A white block represents an entity that is no longer affecting the model. A red dot is a global minimum. (Losses as universal approximators)Figure 2. Visualisations of implicit activations functions σ for cross entropy loss (left) and squared loss (right). (Losses as universal approximators)Figure 3. Visualisation of the construction from Theorem 2, on a simple example of 1D regression from φ(x) to −φ(x)2 + sin(20φ(x)) 5 + 1.2 and 1D target loss pattern T (h) = 1 − (exp(−(h−0.5)2 0.1 ) + exp(−(h+0.5)2 0.1 )). For simplicity of illustration, we present the target function for this toy example as being based on φ(x), rather than x itself – due to the assumed injectivity, this incurs no loss of generality. We use the quadratic loss `(p, y) = (p−y)2 . On the right hand side one sees that effectively, our construction forces the network to build a distribution over predictions, each being slightly shifted, so that after transforming through the loss calculation – they correspond to changes in the target loss pattern. There are two minima, h∗ = ±0.5, that are realised in the resulting model. We provide an empirical result too, with an MLP trained with the construction from Theorem 2, that shows both replication of predictions, as well as a pattern, with the correct placement of minima. (Losses as universal approximators)