HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing 3D geometry–based robotic imitation learning methods, which lack explicit part-level semantic understanding and thus struggle with pose-sensitive manipulation tasks—such as distinguishing a shoe’s toe from its heel. To overcome this, the authors propose HeRO, a hierarchical, semantics-aware policy grounded in diffusion models. HeRO constructs a global–local semantic field by fusing features from DINOv2 and Stable Diffusion and employs a permutation-invariant network for conditional generation, thereby eliminating sequential bias and enabling fine-grained, spatially consistent part perception. Evaluated on the Place Dual Shoes task, HeRO improves success rates by 12.3%, and achieves an average gain of 6.5% across six pose-sensitive manipulation tasks, establishing a new state of the art.

Technology Category

Application Category

📝 Abstract
Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.
Problem

Research questions and friction points this paper is trying to address.

pose-aware manipulation
semantic representation
imitation learning
3D geometry
part-level semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical semantic fields
dense semantics lifting
diffusion-based policy
pose-aware manipulation
permutation-invariant conditioning