Li_Towards_Long-Horizon_Vision-Language-Action_System_Reasoning_Acting_and_Memory@ICCV2025@CVF

Total: 1

#1 Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory [PDF1] [Copy] [Kimi] [REL]

Authors: Daixun Li, Yusi Zhang, Mingxiang Cao, Donglai Liu, Weiying Xie, Tianlin Hui, Lunkai Lin, Zhiqiang Xie, Yunsong Li

Vision-Language-Action (VLA) is crucial for autonomous decision-making in embodied systems. While current methods have advanced single-skill abilities, their short-horizon capability limits applicability in real-world scenarios. To address this challenge, we innovatively propose MindExplore, a general hierarchical VLA system with cross-skill for long-horizon tasks in highly dynamic sand. The key insight is to iteratively align the knowledge domain of task planning and action execution. Thus, this task-oriented action enables outstanding generalization across a wide range of real-world scenarios. In the reasoning layer, task-specific chains of thought (CoT) are designed for planning long-horizon task sequences and providing meta-action signals. In the acting layer, a simple but powerful Mixture of Policy Experts strategy is built inspired by signals and multimodal inputs for adaptively selecting skill experts and generating closed-loop action sequences. Also, it integrates a lightweight Multimodal Diffusion Policy (MMDP) to enhance spatial perception by fusing multi-visual modality features. Besides, the pioneering memory mechanism establishes feedback between the reasoning and acting layers, facilitating adaptive execution of long-horizon tasks and real-time replanning. Notably, we create SandGo-1k and SandThink-21k, the first expert-level multimodal embodied dataset and CoT dataset tailored for sandy environments. At a high execution frequency of 30 FPS, MindExplore is 3.01 times more successful than existing methods in unstructured and dynamic environments.

Subject: ICCV.2025 - Poster