2025.findings-emnlp.556@ACL

Total: 1

#1 Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM [PDF] [Copy] [Kimi] [REL]

Authors: Dingjie Song, Sicheng Lai, Mingxuan Wang, Shunian Chen, Lichao Sun, Benyou Wang

The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination — partial/entire benchmark data is included in the model’s training set — poses critical challenges for fair evaluation. Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training. We systematically analyze multimodal data contamination using our analytical framework, MM-DETECT, which defines two contamination categories — unimodal and cross-modal — and effectively quantifies contamination severity across multiple-choice and caption-based Visual Question Answering tasks. Evaluations on twelve MLLMs and five benchmarks reveal significant contamination, particularly in proprietary models and older benchmarks. Crucially, contamination sometimes originates during unimodal pre-training rather than solely from multimodal fine-tuning. Our insights refine contamination understanding, guiding evaluation practices and improving multimodal model reliability.

Subject: EMNLP.2025 - Findings