10 Open Challenges Steering the Future of Vision-Language-Action Models

#1 10 Open Challenges Steering the Future of Vision-Language-Action Models [PDF⁵] [Copy] [Kimi³] [REL]

Authors: Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu

Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

Subjects: Robotics , Artificial Intelligence

Publish: 2025-11-08 09:02:13 UTC

2511.05936

#1 10 Open Challenges Steering the Future of Vision-Language-Action Models [PDF5] [Copy] [Kimi3] [REL]

#1 10 Open Challenges Steering the Future of Vision-Language-Action Models [PDF⁵] [Copy] [Kimi³] [REL]