M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering

#1 M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering [PDF⁹] [Copy] [Kimi¹¹] [REL]

Authors: Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang

Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-04-06 22:02:21 UTC

2504.04633

#1 M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering [PDF9] [Copy] [Kimi11] [REL]

#1 M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering [PDF⁹] [Copy] [Kimi¹¹] [REL]