Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

#1 Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments [PDF²] [Copy] [Kimi] [REL]

Authors: Pranav Guruprasad, Yangyue Wang, Sudipta Chowdhury, Harshvardhan Sikka

Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLM and VLA models-including GPT-4o, GPT-4.1, OpenVLA,Pi0 Base, and Pi0 FAST-on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights: (1) all evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexit; (2) VLAs generally outperform other models due to their robust architectural design; and (3) VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering.

Subjects: Computer Vision and Pattern Recognition , Machine Learning

Publish: 2025-05-08 16:51:36 UTC

2505.05540

#1 Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments [PDF2] [Copy] [Kimi] [REL]

#1 Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments [PDF²] [Copy] [Kimi] [REL]