HiMaCon: Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current robotic manipulation generalization is hindered by reliance on manual annotations and task-specific definitions. To address this, we propose a novel method for autonomously discovering hierarchical manipulation concepts from unlabeled multimodal temporal data—such as vision and proprioception—without semantic supervision. Our approach employs a dual-branch self-supervised framework that jointly models cross-modal correlations and enables multi-timescale prediction, thereby achieving interpretable disentanglement of manipulation primitives and multi-granularity abstraction. To our knowledge, this is the first method to learn transferable manipulation representations across environments and tasks without any semantic labels. Extensive evaluation in simulation and on real robotic platforms demonstrates that the learned concepts closely align with human-interpretable manipulation primitives and significantly improve downstream policy generalization and sample efficiency.

Technology Category

Application Category

📝 Abstract
Effective generalization in robotic manipulation requires representations that capture invariant patterns of interaction across environments and tasks. We present a self-supervised framework for learning hierarchical manipulation concepts that encode these invariant patterns through cross-modal sensory correlations and multi-level temporal abstractions without requiring human annotation. Our approach combines a cross-modal correlation network that identifies persistent patterns across sensory modalities with a multi-horizon predictor that organizes representations hierarchically across temporal scales. Manipulation concepts learned through this dual structure enable policies to focus on transferable relational patterns while maintaining awareness of both immediate actions and longer-term goals. Empirical evaluation across simulated benchmarks and real-world deployments demonstrates significant performance improvements with our concept-enhanced policies. Analysis reveals that the learned concepts resemble human-interpretable manipulation primitives despite receiving no semantic supervision. This work advances both the understanding of representation learning for manipulation and provides a practical approach to enhancing robotic performance in complex scenarios.
Problem

Research questions and friction points this paper is trying to address.

Learning hierarchical manipulation concepts from unlabeled multi-modal data
Identifying invariant interaction patterns through cross-modal sensory correlations
Enabling policies to focus on transferable relational patterns across tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised hierarchical manipulation concepts learning
Cross-modal correlation network with multi-horizon predictor
Transferable relational patterns without human annotation
🔎 Similar Papers
No similar papers found.
R
Ruizhe Liu
HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong
P
Pei Zhou
HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong
Qian Luo
Qian Luo
HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong; Transcengram
L
Li Sun
HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong
J
Jun Cen
DAMO Academy, Alibaba Group; Hupan Lab
Yibing Song
Yibing Song
Deputy Chief Engineer, BYD Group
Multi-Modal AI
Yanchao Yang
Yanchao Yang
Assistant Professor, HKU; Stanford University; UCLA
Embodied AIComputer VisionMachine Learning