HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

#1 HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation [PDF⁹] [Copy] [Kimi¹] [REL]

Authors: Chongyang Xu, Shen Cheng, Haipeng Li, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.

Subject: Computer Vision and Pattern Recognition

Publish: 2026-02-21 12:29:10 UTC

2602.18817

#1 HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation [PDF9] [Copy] [Kimi1] [REL]

#1 HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation [PDF⁹] [Copy] [Kimi¹] [REL]