🤖 AI Summary
This study addresses the curse of dimensionality in multimodal robotic learning caused by modality redundancy. We propose a unified Cross-Modal Attention (CMA) mechanism that jointly enables dynamic modality selection (e.g., tactile, audio) and unsupervised skill segmentation. CMA employs dynamic modality gating to adaptively attend to the most informative sensory input at each action step, while simultaneously disentangling reusable primitive skill units from expert demonstrations—enabling hierarchical imitation learning. To our knowledge, this is the first work to instantiate both modality selection and skill segmentation within a single CMA framework. The approach significantly improves generalization and sample efficiency on long-horizon, high-contact-density manipulation tasks (e.g., 10+-step assembly): modality selection accuracy increases by 32%, and skill segmentation achieves an F1 score of 0.89.
📝 Abstract
Incorporating additional sensory modalities such as tactile and audio into foundational robotic models poses significant challenges due to the curse of dimensionality. This work addresses this issue through modality selection. We propose a cross-modality attention (CMA) mechanism to identify and selectively utilize the modalities that are most informative for action generation at each timestep. Furthermore, we extend the application of CMA to segment primitive skills from expert demonstrations and leverage this segmentation to train a hierarchical policy capable of solving long-horizon, contact-rich manipulation tasks.