[1910.12037v1] Region Mutual Information Loss for Semantic Segmentation
The idea of RMI is intuitive, and it is also easy to use since it only requires a few additional memory during the training stage

Abstract Semantic segmentation is a fundamental problem in computer vision. It is considered as a pixel-wise classification problem in practice, and most segmentation models use a pixel-wise loss as their optimization criterion. However, the pixelwise loss ignores the dependencies between pixels in an image. Several ways to exploit the relationship between pixels have been investigated, e.g., conditional random fields (CRF) and pixel affinity based methods. Nevertheless, these methods usually require additional model branches, large extra memories, or more inference time. In this paper, we develop a region mutual information (RMI) loss to model the dependencies among pixels more simply and efficiently. In contrast to the pixel-wise loss which treats the pixels as independent samples, RMI uses one pixel and its neighbour pixels to represent this pixel. Then for each pixel in an image, we get a multi-dimensional point that encodes the relationship between pixels, and the image is cast into a multi-dimensional distribution of these high-dimensional points. The prediction and ground truth thus can achieve high order consistency through maximizing the mutual information (MI) between their multi-dimensional distributions. Moreover, as the actual value of the MI is hard to calculate, we derive a lower bound of the MI and maximize the lower bound to maximize the real value of the MI. RMI only requires a few extra computational resources in the training stage, and there is no overhead during testing. Experimental results demonstrate that RMI can achieve substantial and consistent improvements in performance on PASCAL VOC 2012 and CamVid datasets. The code is available at https://github.com/ZJULearning/RMI.
‹Figure 2: An image region and its corresponding multi-dimensional point. Following the same strategy, an image can be cast into a multi-dimensional distribution of many high dimensional points, which encode the relationship between pixels. (Introduction)