Deep Cross-Modal Projection Learning for Image-Text Matching

#1 Deep Cross-Modal Projection Learning for Image-Text Matching [PDF] [Copy] [Kimi¹] [REL]

The key point of image-text matching is how to accurately measure the similarity between visual and textual inputs. Despite the great progress of associating the deep cross-modal embeddings with the bi-directional ranking loss, developing the strategies for mining useful triplets and selecting appropriate margins remains a challenge in real applications. In this paper, we propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classification (CMPC) loss for learning discriminative image-text embeddings. The CMPM loss minimizes the KL divergence between the projection compatibility distributions and the normalized matching distributions defined with all the positive and negative samples in a mini-batch. The CMPC loss attempts to categorize the vector projection of representations from one modality onto another with the improved norm-softmax loss, for further enhancing the feature compactness of each class. Extensive analysis and experiments on multiple datasets demonstrate the superiority of the proposed approach.

Subject: ECCV.2018 - Accept

Ying_Zhang_Deep_Cross-Modal_Projection@2018@ECCV

#1 Deep Cross-Modal Projection Learning for Image-Text Matching [PDF] [Copy] [Kimi1] [REL]

#1 Deep Cross-Modal Projection Learning for Image-Text Matching [PDF] [Copy] [Kimi¹] [REL]