[1912.13200v2] AdderNet: Do We Really Need Multiplications in Deep Learning?
Experiments conducted on benchmark datasets show that AdderNets can well approximate the performance of CNNs with the same architectures, which could have a huge impact on future hardware design

Abstract Compared with cheap addition operation, multiplication operation is of much higher computation complexity. The widely-used convolutions in deep neural networks are exactly cross-correlation to measure the similarity between input feature and convolution filters, which involves massive multiplications between float values. In this paper, we present adder networks (AdderNets) to trade these massive multiplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs. In AdderNets, we take the `1-norm distance between filters and input feature as the output response. The influence of this new similarity measure on the optimization of neural network have been thoroughly analyzed. To achieve a better performance, we develop a special back-propagation approach for AdderNets by investigating the full-precision gradient. We then propose an adaptive learning rate strategy to enhance the training procedure of AdderNets according to the magnitude of each neuron’s gradient. As a result, the proposed AdderNets can achieve 74.9% Top-1 accuracy 91.7% Top-5 accuracy using ResNet-50 on the ImageNet dataset without any multiplication in convolution layer.
‹Figure 1. Visualization of features in AdderNets and CNNs. Features of CNNs in different classes are divided by their angles. In contrast, features of AdderNets tend to be clustered towards different class centers, since AdderNets use the `1-norm to distinguish different classes. The visualization results suggest that `1-distance can served as a similarity measure the distance between the filter and the input feature in deep neural networks (Introduction)Figure 2. Visualization of filters in the first layer of LeNet-5-BN on the MNIST dataset. Both of them can extract useful features for image classification. (Experiments on CIFAR)

(dx Y(m, n, t))/(dx F(i, j, k, t)) == X(m + i, n + j, k)

Figure 3. Learning curve of AdderNets using different optimization schemes. FP and Sgn gradient denotes the full-precision and sign gradient. The proposed adaptive learning rate scaling with full-precision gradient achieves the highest accuracy (99.40%) with the smallest loss. (Visualization Results)Figure 4. Histograms over the weights with AdderNet (left) and CNN (right). The weights of AdderNets follow Laplace distribution while those of CNNs follow Gaussian distribution. (Visualization Results)›