Total: 1
Vision-Language-Action (VLA) is crucial for autonomous decision-making in embodied systems. While current methods have advanced single-skill abilities, their short-horizon capability limits applicability in real-world scenarios. To address this challenge, we innovatively propose MindExplore, a general hierarchical VLA system with cross-skill for long-horizon tasks in highly dynamic sand. The key insight is to iteratively align the knowledge domain of task planning and action execution. Thus, this task-oriented action enables outstanding generalization across a wide range of real-world scenarios. In the reasoning layer, task-specific chains of thought (CoT) are designed for planning long-horizon task sequences and providing meta-action signals. In the acting layer, a simple but powerful Mixture of Policy Experts strategy is built inspired by signals and multimodal inputs for adaptively selecting skill experts and generating closed-loop action sequences. Also, it integrates a lightweight Multimodal Diffusion Policy (MMDP) to enhance spatial perception by fusing multi-visual modality features. Besides, the pioneering memory mechanism establishes feedback between the reasoning and acting layers, facilitating adaptive execution of long-horizon tasks and real-time replanning. Notably, we create SandGo-1k and SandThink-21k, the first expert-level multimodal embodied dataset and CoT dataset tailored for sandy environments. At a high execution frequency of 30 FPS, MindExplore is 3.01 times more successful than existing methods in unstructured and dynamic environments.