DocMMIR: A Framework for Document Multi-modal Information Retrieval

#1 DocMMIR: A Framework for Document Multi-modal Information Retrieval [PDF] [Copy] [Kimi] [REL]

Authors: Zirui Li, Siwei Wu, Yizhi Li, Xingyu Wang, Yi Zhou, Chenghua Lin

The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains—including Wikipedia articles, scientific papers (arXiv), and presentation slides—within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal dataset, comprising 450K training, 19.2K validation, and 19.2K test documents, serving as both a benchmark to reveal the shortcomings of existing MMIR models and a training set for further improvement. The dataset systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP (ViT-L/14) demonstrating reasonable zero-shot performance. Through systematic investigation of cross-modal fusion strategies and loss function selection on the CLIP (ViT-L/14) model, we develop an optimised approach that achieves a +31% improvement in MRR@10 metrics from zero-shot baseline to fine-tuned model. Our findings offer crucial insights and practical guidance for future development in unified multimodal document retrieval tasks.

Subject: EMNLP.2025 - Findings

2025.findings-emnlp.705@ACL

#1 DocMMIR: A Framework for Document Multi-modal Information Retrieval [PDF] [Copy] [Kimi] [REL]