[2001.02394v1] Convolutional Networks with Dense Connectivity

Abstract—Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections—one between each layer and its subsequent layer—our network has L(L+1) 2 direct connections. For each layer, the featuremaps of all preceding layers are used as inputs, and its own featuremaps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, encourage feature reuse and substantially improve parameter efficiency. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less parameters and computation to achieve high performance.
‹Fig. 1: DenseNet layer forward pass: original implementation (left) and efficient implementation (right). Solid boxes correspond to tensors allocated in memory, where as translucent boxes are pointers. Solid arrows represent computation, and dotted arrows represent memory pointers. The efficient implementation stores the output of the concatenation and pre-activation batch normalization/ReLU operations in temporary storage buffers, whereas the original implementation allocates new memory. (DenseNets)Fig. 2: Left: Comparison of the parameter efficiency on C10+ between DenseNet variations. Middle: Comparison of the parameter efficiency between DenseNet and (pre-activation) ResNets. DenseNet requires about 1/3 of the parameters as ResNet to achieve comparable accuracy. Right: Training and testing curves of the 1001-layer pre-activation ResNet [36] with more than 10M parameters and a 100-layer DenseNet with only 0.8M parameters. (Training)Fig. 3: Comparison of the DenseNet and ResNet Top-1 (single model and single-crop) error rates on the ImageNet classification dataset as a function of learned parameters (left), flops (middle, and GPU memory footprint at training time(right). Training GPU memory measured using the efficient LuaTorch DenseNet implementation with a batch size of 64. (Classification Results on CIFAR and SVHN)Fig. 4: The average absolute filter weights of convolutional layers in a trained DenseNet. The color of pixel (s, `) encodes the average L1 norm (normalized by the number of input feature maps) of the weights connecting convolutional layer s to layer ` within a dense block. The three columns highlighted by black rectangles correspond to the two transition layers and the classification layer. The first row encodes those weights connected to the input layer of the dense block. (Discussion)Fig. 5: Left and Middle: GPU memory consumption as a function of network depth/number of parameters. Each model is a DenseNet with k = 12 features added per layer. The efficient implementation can train much deeper models with less memory. Right: Computation time (on a NVIDIA Maxwell Titan-X). (Discussion)Fig. 6: Top-1 validation error on ImageNet as a function of the computational cost (measured by flops) of different DenseNets, with varying growth rate (Left), varying width of the bottleneck layers (Middle) and varying compression ratio at transition layers (Right). (Discussion)Fig. 7: Comparison of DenseNet and its variant with full dense connectivity, in terms of parameter efficiency (Left) and computational efficiency (Right). (Architecture Hyperparameters)Fig. 8: Comparison of DenseNet and its variants with partial dense connectivity, in terms of parameter efficiency (Left) and computational efficiency (Right). (Full Dense Connectivity)Fig. 9: Comparison of DenseNet and its variant with increasing growth rate, in terms of parameter efficiency (Left) and computational efficiency (Right). (Partial Dense Connection)Fig. 10: Comparison of preand postactivation BN in DenseNets, in terms of parameter efficiency (Left) and computational efficiency (Right). (Exponentially Increasing Growth Rate)›