[1910.11127] Reversible designs for extreme memory cost reduction of CNN training
We highlighted several potential applications of memory-efficient training procedures, such as on-device training, and illustrated the efficiency of our approach by training a CNN to 93.3% accuracy on a low-end GPU with only 1GB of memory.

Abstract: Training Convolutional Neural Networks (CNN) is a resource intensive task
that requires specialized hardware for efficient computation. One of the most
limiting bottleneck of CNN training is the memory cost associated with storing
the activation values of hidden layers needed for the computation of the
weights gradient during the backward pass of the backpropagation algorithm.
Recently, reversible architectures have been proposed to reduce the memory cost
of training large CNN by reconstructing the input activation values of hidden
layers from their output during the backward pass, circumventing the need to
accumulate these activations in memory during the forward pass. In this paper,
we push this idea to the extreme and analyze reversible network designs
yielding minimal training memory footprint. We investigate the propagation of
numerical errors in long chains of invertible operations and analyze their
effect on training. We introduce the notion of pixel-wise memory cost to
characterize the memory footprint of model training, and propose a new model
architecture able to efficiently train arbitrarily deep neural networks with a
minimum memory cost of 352 bytes per input pixel. This new kind of architecture
enables training large neural networks on very limited memory, opening the door
for neural network training on embedded devices or non-specialized hardware.
For instance, we demonstrate training of our model to 93.3% accuracy on the
CIFAR10 dataset within 67 minutes on a low-end Nvidia GTX750 GPU with only 1GB
of memory.

Figure 1: Illustration of the ResNet-18 architecture and its memory requirements. Modules contributing to the peak memory consumption are shown in red. These modules contribute to the memory cost by storing their input in memory. The green annotation represents the extra memory cost of storing the gradient in memory. The peak memory consumption happens in the backward pass through the last convolution so that this layer is annotated with an additional gradient memory cost. At this step of the computation, all lower parameterized layers have stored their input in memory, which constitutes the memory bottleneck. (FIGURE LEGEND)Figure 7: Illustration of a layer-wise invertible architecture and its memory consumption. (FIGURE LEGEND)Figure 4: Illustration of the i-Revnet architecture and its memory consumption. The peak memory consumption happens during the backward pass through the top reversible block. In addition to this local memory bottleneck, the cost of storing the top layers weights (in orange) becomes a new memory bottleneck as the weight kernel size grows quadratically in the number of channels. (FIGURE LEGEND)Figure 5: Illustration of the numerical errors arising from batch normalization layers. Comparison of the theoretical and empirical evolution of the α ratio for different ρ values in our toy example. Empirical values were computed for a Gaussian input signal with zero mean and standard deviation 1 and a white Gaussian noise of standard deviation 10−5 . (FIGURE LEGEND)Figure 6: Illustration of the numerical errors arising from invertible activation layers. Comparison of the theoretical and empirical evolution of the α ratio for different negative slopes n. Empirical values were computed for a Gaussian input signal with zero mean and standard deviation 1 and a white Gaussian noise of standard deviation 10−5 . (FIGURE LEGEND)Figure 8: Illustration of a hybrid architecture and its peak memory consumption. (FIGURE LEGEND)Figure 9: Illustration of the backpropagation process through a reversible block of our proposed hybrid architecture. In the forward pass (left), activations are propagated forward from top to bottom. The activations are not kept in live memory as they are to be recomputed in the backward pass so that no memory bottleneck occurs. The backward pass is made of two phases: First the input activations are recomputed from the output using the Reversible block analytical inverse (middle). This step allows to reconstruct the input activations with minimal reconstruction error. During this step, hidden activations are not kept in live memory so as to avoid the local memory bottleneck of the reversible block. Once the input activation recomputed, the gradients are propagated backward through both modules of the reversible blocks (right). During this second phase, hidden activations are recomputed backward through each module using the layer-wise inverse operations, yielding minimal memory footprint (FIGURE LEGEND)Figure 10: Evolution of the SNR through the layers of a layer-wise invertible model. Color boxes illustrate the span of two consecutive convolutional blocks (Convolution-normalization-activation layers). The SNR gets continuously degraded throughout each block of the network, resulting in numerical instabilities. (FIGURE LEGEND)Figure 11: Illustration of the impact of depth (in number of layers N) and negative slope n on the numerical errors. Both figure shows the evolution of the SNR at the lowest layer of a layer-wise invertible network with increasing depth and negative slopes. The lower the SNR is, the more important numerical errors of the inverse reconstructions are. (Left): The SNR decreases exponentially with depth until it reaches an SNR value of 1. At this point, the noise is of the same scale as the signal, and no learning can happen. These results were computed with a negative slope of n = 2 (Right) This figure shows the evolution of the SNR with different negative slopes n for a layer-wise reversible model of depth 3. On a log-log scale, this figure shows an almost linear relationship between negative slope and SNR. It is impressive that with only three layer depth, a negative slope of n = 10−3 reaches a SNR superior to 1. With such parameterization, even the most shallow models are not capable of learning. (FIGURE LEGEND)Figure 12: Impact of the numerical errors on the accuracy of layer-wise invertible models. (Left): Evolution of a 6-layer model accuracy with and without inverse reconstructions with the negative slope. Without reconstruction, the model accuracy benefits from smaller negative slopes. With inverse reconstructions, the model similarly benefits from smaller negative slopes as n decreases from 1 to 0.1. For smaller negative slopes, however, the accuracy sharply decreases toward lower values due to numerical errors. (Right) Evolution of the accuracy with depth for a negative slope n = 0.2 with and without inverse reconstructions. Without reconstruction, the model accuracy benefits from depth. With inverse reconstructions, the model similarly benefits from depth as the number of layers grow from 3 to 7. For N > 7, however, the accuracy sharply decreases toward lower values due to numerical errors. (FIGURE LEGEND)Figure 13: Evolution of the SNR through the layers of a hybrid architecture model. The span of two consecutive reversible blocks are shown with color boxes. Within reversible blocks, the SNR quickly degrades due to the numerical errors introduced by invertible layers. However, the signal propagated to the input of each reversible block is recomputed using the reversible block inverse, which is much more stable. Hence, we can see a sharp decline of the SNR within the reversible blocks, but the SNR almost raises back to its original level at the input of each reversible block. (FIGURE LEGEND)Figure 14: Illustration of the impact of depth (in number of layers N) and negative slope n on the numerical errors. Both figure shows the evolution of the SNR at the lowest layer of our hybrid architecture with increasing depth and negative slopes. Our hybrid architecture greatly reduce the impact of both depth and negative slopes on the numerical errors (FIGURE LEGEND)Figure 2: Illustration of the backpropagation process through a reversible block. In the forward pass (left), activations are propagated forward from top to bottom. The activations are not kept in live memory as they are to be recomputed in the backward pass so no memory bottleneck occurs. The backward pass is made of two phases: First the hidden and input activations are recomputed from the output through an additional forward pass through both modules (middle). Once the activations recomputed, the activations gradient are propagated backward through both modules of the reversible blocks (right). Because the activation and gradient computations flow in opposite directions through both modules, both computations cannot be efficiently overlapped, which results in the local memory bottleneck of storing all hidden activations within the reversible block before the gradient backpropagation step. (FIGURE LEGEND)Figure 3: Illustration of the Revnet architecture and its memory consumption. Modules contributing to the peak memory consumption are shown in red. The peak memory consumption happens during the backward pass through the first reversible block. At this step of the computations, all hidden activations within the reversible block are stored in memory simultaneously. (FIGURE LEGEND)Figure 15: Impact of the numerical errors on the accuracy of layer-wise invertible models. Our proposed hybrid architecture greatly stabilizes the numerical errors, which results in smaller effects of the depth and negative slope on accuracy. (FIGURE LEGEND)›