Roth_Context-Aware_Multimodal_Pretraining@CVPR2025@CVF

Total: 1

#1 Context-Aware Multimodal Pretraining [PDF104] [Copy] [Kimi56] [REL]

Authors: Karsten Roth, Zeynep Akata, Dima Damen, Ivana Balazevic, Olivier J. Henaff

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5\%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations surpass significantly more complex optimization-based adaptation schemes.

Subject: CVPR.2025 - Highlight