2025.findings-emnlp.953@ACL

Total: 1

#1 MAFMO: Multi-modal Adaptive Fusion with Meta-template Optimization for Vision-Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Mingrui Xie, Lulu Xu, Junliang Du

Vision-language models like CLIP demonstrate exceptional generalization capabilities but face significant adaptation challenges due to parameter scale, prompt sensitivity, and cross-modal alignment difficulties. Existing approaches primarily focus on single-modality adjustments, leading to suboptimal alignment and limited generalization. We introduce MAFMO, a plug-and-play framework comprising: (1) a Harmonic Cross-Modal Adapter enabling efficient cross-modal knowledge transfer; (2) a Meta-Template Optimization module dynamically generating input-dependent templates; and (3) a Cross-Modal Knowledge Synthesis mechanism preserving critical structural relationships during adaptation. Extensive experiments across multiple fine-grained visual recognition benchmarks demonstrate MAFMO consistently improves existing methods’ performance on both novel classes and harmonic mean, while maintaining robustness under various challenging conditions with minimal computational overhead.

Subject: EMNLP.2025 - Findings