Total: 1
Image-text matching is a crucial task that bridges visual and linguistic modalities. Recent research typically formulates it into the problem of maximizing the margin with the truly hardest negatives to enhance the learning efficiency and avoid the poor local optima. We argue that such formulation can lead to a serious limitation, i.e., under this formulation, conventional trainers would confine their horizon within the hardest negative examples, while other negative examples offer a range of semantic differences not present in the hardest negatives. In this paper, we propose an efficient negative distribution guided training framework for image-text matching to unlock the substantial promotion space left by the above limitation. Rather than simply incorporating additional negative examples into the training objective, which could diminish both the leading role of the hardest negatives in training and the effect of a large margin learning in producing a robust matching model, our central idea is to supply the objective with distributional information on the entire set of negative examples. To be precise, we first construct the sample similarity matrix based on several pretrained models to extract the distributional information of the entire negative sample dataset. Then we encode it into a margin regularization module to smooth the similarities differences of all negatives. This enhancement facilitates the capture of fine-grained semantic differences and guides the main learning process by maximizing the margin with hard negative examples. Furthermore, we propose a hardest negative rectification module to address the instability in hardest negative selection based on predicted similarity and to correct erroneous hardest negatives. We evaluate our method in combination with several state-of-the-art image-text matching methods, and our quantitative and qualitative experiments demonstrate its significant generalizability and effectiveness.