[1910.11119] Designovel's system description for Fashion-IQ challenge 2019
As future work, we will focus on the positive example mining method since each candidate can have multiple matched targets, while the given training and validation datasets only indicate a single target which may potentially lead to overfitting
Abstract This paper describes Designovel’s systems submitted to the Fashion IQ Challenge 2019. The goal of the challenge is to build an image retrieval system in which the input query is a candidate image with two text phrases describing users’ feedback about visual differences between the candidate image and the search target. We built the systems by combining methods from recent work on deep metric learning, multi-modal retrieval and natural language processing. First, we encode both candidate and target images with CNNs into high-level representations, and encode text descriptions to a single text vector using a Transformer-based encoder. Then we compose the candidate image vector and text representation into a single vector which is expected to be biased toward the target image vector. Finally, we compute cosine similarities between the composed vector and encoded vectors of the whole dataset, and rank them in descending order to get a ranked list. We experimented with the Fashion IQ 2019 dataset with various hyperparameters, achieving a 39.12% average recall with a single model and 43.67% average recall with an ensemble of 16 models on the test dataset.
‹Figure 1. Overview of Designovel’s system. (Introduction)›