Rethinking Intermediate Representation for VLM-based Robot Manipulation

#1 Rethinking Intermediate Representation for VLM-based Robot Manipulation [PDF⁴] [Copy] [Kimi²] [REL]

Authors: Weiliang Tang, Jialin Gao, Jia-Hui Pan, Gang Wang, Li Erran Li, Yunhui Liu, Mingyu Ding, Pheng-Ann Heng, Chi-Wing Fu

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.

Subject: Robotics

Publish: 2025-11-24 17:09:50 UTC

2511.19315

#1 Rethinking Intermediate Representation for VLM-based Robot Manipulation [PDF4] [Copy] [Kimi2] [REL]

#1 Rethinking Intermediate Representation for VLM-based Robot Manipulation [PDF⁴] [Copy] [Kimi²] [REL]