dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

#1 dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [PDF²] [Copy] [Kimi] [REL]

Authors: Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, Chaowei Xiao

The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-12-04 05:05:41 UTC

2512.04459

#1 dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [PDF2] [Copy] [Kimi] [REL]

#1 dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [PDF²] [Copy] [Kimi] [REL]