HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the limitation of existing 3D geometry–based robotic imitation learning methods, which lack explicit part-level semantic understanding and thus struggle with pose-sensitive manipulation tasks—such as distinguishing a shoe’s toe from its heel. To overcome this, the authors propose HeRO, a hierarchical, semantics-aware policy grounded in diffusion models. HeRO constructs a global–local semantic field by fusing features from DINOv2 and Stable Diffusion and employs a permutation-invariant network for conditional generation, thereby eliminating sequential bias and enabling fine-grained, spatially consistent part perception. Evaluated on the Place Dual Shoes task, HeRO improves success rates by 12.3%, and achieves an average gain of 6.5% across six pose-sensitive manipulation tasks, establishing a new state of the art.

Technology Category

Application Category

📝 Abstract

Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.

Problem

Research questions and friction points this paper is trying to address.

pose-aware manipulation

semantic representation

imitation learning

3D geometry

part-level semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical semantic fields

dense semantics lifting

diffusion-based policy