Wang_Mamba-3VL_Taming_State_Space_Model_for_3D_Vision_Language_Learning@ICCV2025@CVF

Total: 1

#1 Mamba-3VL: Taming State Space Model for 3D Vision Language Learning [PDF] [Copy] [Kimi] [REL]

Authors: Yuan Wang, Yuxin Chen, Zhongang Qi, Lijun Liu, Jile Jiao, Xuetao Feng, Yujia Liang, Ying Shan, Zhipeng Zhang

3D vision-language (3D-VL) reasoning, connecting natural language with 3D physical world, represents a milestone in advancing spatial intelligence. While transformer-based methods dominate 3D-VL research, their quadratic complexity and simplistic positional embedding mechanisms severely limits effective modeling of long-range 3D-VL dependencies and spatial relationships in 3D-VL tasks. State Space Models (SSM) have emerged as promising linear-complexity alternatives for sequential data processing, while inherent selection mechanism offers notable capability for spatial modeling. Despite its potential, straightforward adoption of Mamba to 3D-VL tasks encounters two obstacles: (1) how to perceive the position of 3D objects and understand complex spatial relationships, and (2) how to achieve thorough synergies of multi-modal features. In this paper, we propose Mamba-3VL, a pioneering 3D-VL framework to model complex intra- and inter-modality correlations and enhance spatial relation reasoning, while guaranteeing top-tier performance, high efficiency, and generalization potential for 3D-VL tasks. Specifically, Mamba Mixer explicitly models 3D-VL interaction via channel twisting and relation-prioritized spatial scanning policy. It maximally retain spatial relation of object-centric features. To further provide precise spatial encoding for mamba, we develop Instance-aware Dynamic Position Adapter (IDPA) to dynamically adjust instance-specific positional embeddings and enhance local spatial relation of 3D objects. Extensive results validate Mamba-3VL trumps other competitors on seven 3D-VL benchmarks and showcases versatile potentials for challenging Embodied AI tasks.

Subject: ICCV.2025 - Poster