HiMaCon: Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Current robotic manipulation generalization is hindered by reliance on manual annotations and task-specific definitions. To address this, we propose a novel method for autonomously discovering hierarchical manipulation concepts from unlabeled multimodal temporal data—such as vision and proprioception—without semantic supervision. Our approach employs a dual-branch self-supervised framework that jointly models cross-modal correlations and enables multi-timescale prediction, thereby achieving interpretable disentanglement of manipulation primitives and multi-granularity abstraction. To our knowledge, this is the first method to learn transferable manipulation representations across environments and tasks without any semantic labels. Extensive evaluation in simulation and on real robotic platforms demonstrates that the learned concepts closely align with human-interpretable manipulation primitives and significantly improve downstream policy generalization and sample efficiency.

Technology Category

Application Category

📝 Abstract

Effective generalization in robotic manipulation requires representations that capture invariant patterns of interaction across environments and tasks. We present a self-supervised framework for learning hierarchical manipulation concepts that encode these invariant patterns through cross-modal sensory correlations and multi-level temporal abstractions without requiring human annotation. Our approach combines a cross-modal correlation network that identifies persistent patterns across sensory modalities with a multi-horizon predictor that organizes representations hierarchically across temporal scales. Manipulation concepts learned through this dual structure enable policies to focus on transferable relational patterns while maintaining awareness of both immediate actions and longer-term goals. Empirical evaluation across simulated benchmarks and real-world deployments demonstrates significant performance improvements with our concept-enhanced policies. Analysis reveals that the learned concepts resemble human-interpretable manipulation primitives despite receiving no semantic supervision. This work advances both the understanding of representation learning for manipulation and provides a practical approach to enhancing robotic performance in complex scenarios.

Problem

Research questions and friction points this paper is trying to address.

Learning hierarchical manipulation concepts from unlabeled multi-modal data

Identifying invariant interaction patterns through cross-modal sensory correlations

Enabling policies to focus on transferable relational patterns across tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised hierarchical manipulation concepts learning

Cross-modal correlation network with multi-horizon predictor

Transferable relational patterns without human annotation

🔎 Similar Papers

No similar papers found.