MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation

#1 MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation [PDF⁵] [Copy] [Kimi¹] [REL]

Authors: Ruihan Zhao, Tyler Ingebrand, Sandeep Chinchali, Ufuk Topcu

Vision-Language-Action (VLA) models trained on large robot datasets promise general-purpose, robust control across diverse domains and embodiments. However, existing approaches often fail out-of-the-box when deployed in novel environments, embodiments, or tasks. We introduce Mixture of Skills VLA (MoS-VLA), a framework that represents robot manipulation policies as linear combinations of a finite set of learned basis functions. During pretraining, MoS-VLA jointly learns these basis functions across datasets from the Open X-Embodiment project, producing a structured skill space. At test time, adapting to a new task requires only a single expert demonstration. The corresponding skill representation is then inferred via a lightweight convex optimization problem that minimizes the L1 action error, without requiring gradient updates. This gradient-free adaptation incurs minimal overhead while enabling rapid instantiation of new skills. Empirically, MoS-VLA achieves lower action-prediction error on five out of five unseen datasets and succeeds in both simulation and real-robot tasks where a pretrained VLA model fails outright. Project page: mos-vla.github.io/

Subject: Robotics

Publish: 2025-10-18 19:16:08 UTC

2510.16617

#1 MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation [PDF5] [Copy] [Kimi1] [REL]

#1 MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation [PDF⁵] [Copy] [Kimi¹] [REL]