DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

#1 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [PDF] [Copy] [Kimi] [REL]

Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including geometry, semantics and spatial information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing an action-forecasting loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction mechanism, which anticipates visual, depth, geometric, semantic, and segmentation cues to provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features and better captures multimodal uncertainty. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.45 average length on the CALVIN ABC-D benchmarks.

Subject: NeurIPS.2025 - Poster

PK07eretkF@OpenReview

#1 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [PDF] [Copy] [Kimi] [REL]