Bridging Vision and Language Concepts through Optimal Transport Semantic Flow

#1 Bridging Vision and Language Concepts through Optimal Transport Semantic Flow [PDF¹] [Copy] [Kimi] [REL]

Authors: Chenyang Zhang, Anqi Dong, Guangming Zhu, Nuoye Xiong, Siyuan Wang, Lin Mei, Liang Zhang

Concept Bottleneck Models (CBMs) promise transparent reasoning by predicting through human-interpretable concepts, yet their effectiveness fundamentally depends on how well visual and textual representations are aligned or matched. Existing vision-language CBMs often rely on pre-aligned encoders or global cosine similarity, which obscures fine-grained concept localization and fails to reflect true semantic geometry. In this work, we rethink concept alignment as a dynamic cross-modal transport process instead of static projection and propose the Optimal Transport Flow Concept Bottleneck Model (OTF-CBM). It first learns a data-driven semantic cost via Inverse Optimal Transport to measure cross-modal distances, and then performs unbalanced optimal-transport-based flow matching to model semantic transitions between visual patches and textual concepts. With velocity-based concept activation, OTF-CBM captures interpretable geometric relations without ODE integration. Experiments further show that OTF-CBM achieves superior classification accuracy and concept faithfulness, offering a new geometric and dynamical perspective for interpretable cross-modal reasoning.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2026-06-25 11:24:44 UTC

2606.26891

#1 Bridging Vision and Language Concepts through Optimal Transport Semantic Flow [PDF1] [Copy] [Kimi] [REL]

#1 Bridging Vision and Language Concepts through Optimal Transport Semantic Flow [PDF¹] [Copy] [Kimi] [REL]