[1911.05932v1] GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs
We reported state-of-the-art performance on the task of correspondence estimation on the HPSequence dataset, the SUN3D dataset and several new datasets with extreme scale and orientation changes
Abstract Finding local correspondences between images with different viewpoints requires local descriptors that are robust against geometric transformations. An approach for transformation invariance is to integrate out the transformations by pooling the features extracted from transformed versions of an image. However, the feature pooling may sacrifice the distinctiveness of the resulting descriptors. In this paper, we introduce a novel visual descriptor named Group Invariant Feature Transform (GIFT), which is both discriminative and robust to geometric transformations. The key idea is that the features extracted from the transformed versions of an image can be viewed as a function defined on the group of the transformations. Instead of feature pooling, we use group convolutions to exploit underlying structures of the extracted features on the group, resulting in descriptors that are both discriminative and provably invariant to the group of transformations. Extensive experiments show that GIFT outperforms state-of-the-art methods on several benchmark datasets and practically improves the performance of relative pose estimation.
‹Figure 1: Pipeline. The input image is warped with different transformations and fed into a vanilla CNN to extract group features. Then the group features for each interest point are further processed by two group CNNs and a bilinear pooling operator to obtain final GIFT descriptors. (Method)Figure 2: The scaling and rotation of an image (left) result in the permutation of the feature maps defined on the scaling and rotation group (right). The red arrows illustrate the directions of the permutation. (Method)Figure 3: Visualization of estimated correspondences on HPSequences (first two rows), ER-HP (middle two rows) and ES-HP (last two rows). The first two columns use keypoints detected by Superpoint  and the last two columns use keypoints detected by DoG . (Datasets and Metrics)Figure 4: Visualization of estimated dense correspondences. Matched points are drawn with the same color in the reference and query images. Only correctly estimated correspondences are drawn. (Performance for dense correspondence estimation)Figure 5: PCKs on the HPatches dataset as scaling and rotation increase. GIFT-SP uses Superpoint as the detector while GIFT-DoG uses DoG as the detector. (Performance for dense correspondence estimation)›