[1910.02629v1] Softmax Is Not an Artificial Trick: An Information-Theoretic View of Softmax in Neural Networks

Despite great popularity of applying softmax to map the non-normalised outputs of a neural network to a probability distribution over predicting classes, this normalised exponential transformation still seems to be artificial. A theoretic framework that incorporates softmax as an intrinsic component is still lacking. In this paper, we view neural networks embedding softmax from an informationtheoretic perspective. Under this view, we can naturally and mathematically derive log-softmax as an inherent component in a neural network for evaluating the conditional mutual information between network output vectors and labels given an input datum. We show that training deterministic neural networks through maximising log-softmax is equivalent to enlarging the conditional mutual information, i.e., feeding label information into network outputs. We also generalise our informativetheoretic perspective to neural networks with stochasticity and derive information upper and lower bounds of log-softmax. In theory, such an information-theoretic view offers rationality support for embedding softmax in neural networks; in practice, we eventually demonstrate a computer vision application example of how to employ our information-theoretic view to filter out targeted objects on images.
‹

Figure 1: Feed-Forward Figure 2: Back-Propagation Figure 3: Probabilistic graphical models for visualising information flow of training with logarithm softmax. (Informative Flow of Training with Log-Softmax)Figure 4: Original Image Figure 5: Filtering Out Digit 0 Figure 6: Demonstration of info-masking performance, where we target to filter out all the digit 0 objects. (Application Example: Information Masking (Info-Masking))