[1910.06962v1] SegSort: Segmentation by Discriminative Sorting of Segments
We demonstrated the proposed approach consistently improves over the conventional pixel-wise prediction approaches for supervised semantic segmentation
Abstract Almost all existing deep learning approaches for semantic segmentation tackle this task as a pixel-wise classification problem. Yet humans understand a scene not in terms of pixels, but by decomposing it into perceptual groups and structures that are the basic building blocks of recognition. This motivates us to propose an end-to-end pixel-wise metric learning approach that mimics this process. In our approach, the optimal visual representation determines the right segmentation within individual images and associates segments with the same semantic classes across images. The core visual learning problem is therefore to maximize the similarity within segments and minimize the similarity between segments. Given a model trained this way, inference is performed consistently by extracting pixel-wise embeddings and clustering, with the semantic label determined by the majority vote of its nearest neighbors from an annotated set. As a result, we present the SegSort, as a first attempt using deep learning for unsupervised semantic segmentation, achieving 76% performance of its supervised counterpart. When supervision is available, SegSort shows consistent improvements over conventional approaches based on pixel-wise softmax training. Additionally, our approach produces more precise boundaries and consistent region predictions. The proposed SegSort further produces an interpretable result, as each choice of label can be easily understood from the retrieved nearest segments.
‹Figure 1. Top: Our proposed approach partitions an image in the embedding space into aligned segments (framed in red) and assign the majority labels from retrieved segments (framed in green or pink). Bottom: Our approach presents the first deep learning based unsupervised semantic segmentation (right). If supervised, our approach produces more consistent region predictions and precise boundaries in the supervised setting (middle) compared to its parametric counterpart (left). (Introduction)Figure 2. The overall training diagram for our proposed framework, Segment Sorting (SegSort), with the vMF clustering [4]. Given a batch of images (leftmost), we compute pixel-wise embeddings (middle left) from a metric learning segmentation network. Then we segment each image with the vMF clustering (middle right), dubbed pixel sorting. We train the network via the maximum likelihood estimation derived from a mixture of vMF distributions, dubbed segment sorting. In between, we also illustrate how to process pixel-wise features on a hyper-sphere for pixel and segment sorting. A segment (rightmost) is color-framed with its corresponding vMF clustering color if in the displayed images. Unframed segments from different images are associated in the embedding space. The inference is done with the same procedure but using the k-nearest neighbor search to associate segments in the training set. (Related Works)Figure 3. During supervised training, we partition the proposed segments (left) given the ground truth mask (middle). The yielded segments (right) are thus aligned with ground truth mask. Each aligned segment is labeled (0 or 1) according to ground truth mask. Note that the purple and yellow segments become, respectively, false positive and false negative that help regularize predicted boundaries. (Segment Sorting)Figure 4. Visual comparison on PASCAL VOC 2012 validation set. We show the visual examples with DeepLabv3+ (upper 2 rows) and PSPNet (lower 2 rows). We observe prominent improvements on thin structures, such as human leg and chair legs. Also, more consistent region predictions can be observed when context is critical, such as wheels in motorcycles and big trunk of buses. (Experimental Setup)
Figure 5. Two examples, correct and incorrect predictions, for segment retrieval for supervised semantic segmentation on VOC 2012 validation set. Query segments (leftmost) are framed by the same color in clustering. (Top) The query segments of rider, horse, and horse outlines can retrieve corresponding semantically relevant segments in the training set. (Bottom) For the failure case, it can be inferred from the retrieved segments that the number tag on the front of bikes is confused by the other number tags or front lights on motorbikes, resulting in false predictions. (Experimental Setup)Figure 6. Training data for unsupervised semantic segmentation. We produce fine segmentations (right), HED-owt-ucm, from the contours (middle) detected by HED [73], followed by the procedure in gPb-owt-ucm [49]. (Fully Supervised Semantic Segmentation)
Figure 7. Segment retrieval results for unsupervised semantic segmentation on VOC 2012 vadlidation set. Query segments (leftmost) are framed by the same color in clustering. As is observed, the embeddings learned by unsupervised SegSort attend to more visual than semantic similarities compared to the supervised setting. Hairs, faces, blue shirts, and wheels are retrieved successfully. The last query segment fails because the texture around knee is more similar to animal skins. (Fully Supervised Semantic Segmentation)Figure 11. We show how the number of clusters affects the segmentation performance. The highest performance is at 25 clusters, which are slightly more than the number of categories in the dataset. (Ablation Study)Figure 8. We perform a nearest neighbor based hierarchical agglomerative clustering, FINCH [59] on foreground segment prototypes to discover visual groups. Top two rows show random samples from two clusters at the finest level. Bottom table displays clusters at a coarser level of 16 clusters. We show four representative segments per cluster. (Unsupervised Semantic Segmentation)Figure 9. t-SNE visualization of prototype embeddings from supervised SegSort, framed with category color. Best viewed with zoom-in. (t-SNE Embedding Visualization)Figure 10. t-SNE visualization of prototype embeddings from unsupervised SegSort, framed with category color. Best viewed with zoom-in. (t-SNE Embedding Visualization)Figure 12. We study how the dimension of embeddings affects the segmentation performance. We conclude that as long as the embedding dimension is sufficient, i.e., larger than 8, the performance does not change drastically. (Ablation Study)Figure 13. We study how the number of nearest neighbors during inference affects the segmentation performance. We conclude that the segmentation performance is robust to the number of nearest neighbors as the mIoU spans only 0.4%. (Ablation Study)Figure 14. Visual comparison on Cityscapes validation set. We observe large objects, such as ‘bus’ and ‘truck’, are improved thanks to more consistent region predictions while small objects, such as ‘pole’ and ‘tlight’, are also better captured. (DeepLabv3+ / ResNet-101 Results on VOC)›