PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment

#1 PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment [PDF¹] [Copy] [Kimi] [REL]

Authors: Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno

Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-07-12 04:53:39 UTC

2507.09139

#1 PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment [PDF1] [Copy] [Kimi] [REL]

#1 PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment [PDF¹] [Copy] [Kimi] [REL]