RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

#1 RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion [PDF³] [Copy] [Kimi¹] [REL]

Authors: Zhe Li, Cheng Chi, Yangyang Wei, Boan Zhu, Tao Huang, Zhenguo Sun, Yibo Peng, Pengwei Wang, Zhongyuan Wang, Fangzhou Liu, Chang Xu, Shanghang Zhang

Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.

Subjects: Robotics , Computer Vision and Pattern Recognition

Publish: 2025-12-29 17:59:19 UTC

2512.23649

#1 RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion [PDF3] [Copy] [Kimi1] [REL]

#1 RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion [PDF³] [Copy] [Kimi¹] [REL]