ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

#1 ExtPose: Robust and Coherent Pose Estimation by Extending ViTs [PDF] [Copy] [Kimi] [REL]

Authors: Glory Rongyu CHEN, Li'an Zhuo, Linlin Yang, Qi WANG, Liefeng Bo, Bang Zhang, Angela Yao

Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.

Subject: ICML.2025 - Poster

hm9FNEZZ6z@OpenReview

#1 ExtPose: Robust and Coherent Pose Estimation by Extending ViTs [PDF] [Copy] [Kimi] [REL]