Caffagni_Recurrence-Enhanced_Vision-and-Language_Transformers_for_Robust_Multimodal_Document_Retrieval@CVPR2025@CVF

Total: 1

#1 Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval [PDF] [Copy] [Kimi] [REL]

Authors: Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries -- composed of both an image and a text -- and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that Ret achieves state-of-the-art performance across diverse settings.

Subject: CVPR.2025 - Poster