Can Generative Video Models Help Pose Estimation?

#1 Can Generative Video Models Help Pose Estimation? [PDF¹²] [Copy] [Kimi³] [REL]

Authors: Ruojin Cai, Jason Y. Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, Ricardo Martin-Brualla

Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation.Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos.We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R baseline on four diverse datasets encompassing indoor, outdoor, and object-centric scenes.Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data.

Subject: CVPR.2025 - Highlight

Cai_Can_Generative_Video_Models_Help_Pose_Estimation@CVPR2025@CVF

#1 Can Generative Video Models Help Pose Estimation? [PDF12] [Copy] [Kimi3] [REL]

#1 Can Generative Video Models Help Pose Estimation? [PDF¹²] [Copy] [Kimi³] [REL]