FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

#1 FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization [PDF⁴] [Copy] [Kimi¹] [REL]

Authors: Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

Subjects: Computer Vision and Pattern Recognition , Robotics

Publish: 2025-12-04 16:21:38 UTC

2512.04952

#1 FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization [PDF4] [Copy] [Kimi1] [REL]

#1 FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization [PDF⁴] [Copy] [Kimi¹] [REL]