Chharia_MV-SSM_Multi-View_State_Space_Modeling_for_3D_Human_Pose_Estimation@CVPR2025@CVF

Total: 1

#1 MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation [PDF1] [Copy] [Kimi] [REL]

Authors: Aviral Chharia, Wenbo Gou, Haoye Dong

Though single-view 3D human pose estimation has gained much attention, 3D multi-view multi-person pose estimation faces several challenges including the presence of occlusions and generalizability to new camera arrangements or scenarios. Existing transformer-based approaches often struggle to accurately model joint spatial sequences, especially in occluded scenarios. To address this, we present a novel Multi-View State Space Modeling framework, named MV-SSM for robustly reconstructing 3D human poses, by explicitly modeling the joint spatial sequence at two distinct levels: the feature level from multi-view images and the joint level of the person. Specifically, we propose a Projective State Space (PSS) block to learn the joint spatial sequences using state space modeling. Furthermore, we modify Mamba's unidirectional scanning into an effective Grid token-guided Bidirectional scan (GTBS) which is integral to the PSS block. Experiments on multiple challenging benchmarks demonstrate that MV-SSM archives highly accurate 3D pose estimation and is generalizable across the number of cameras (+10.8 on AP25 on the challenging 3 camera setting in CMU Panoptic), varying camera arrangements (+7.0 on AP25), and cross-datasets (+15.3 PCP on Campus A1), significantly outperforming SOTAs. The code has been submitted and will be open-sourced with model weights upon acceptance.

Subject: CVPR.2025 - Poster