Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

For contact-rich robotic manipulation tasks, challenges persist in synergistic multimodal (vision, force, proprioception) perception, noise robustness, and low online sample efficiency. This paper proposes MSDP, a self-supervised Multisensory Dynamic Pre-training framework. MSDP employs a masked cross-modal autoencoder with frozen perceptual representations and an asymmetric cross-modal attention architecture to decouple dynamic task-feature extraction (critic) from stable policy execution (actor), enabling efficient multimodal fusion. Without human annotations, MSDP achieves high success rates after only 6,000 online interactions in both simulation and real-robot settings, while maintaining strong robustness under sensor noise and varying object dynamics. Key contributions include: (1) the first multisensory dynamic pre-training paradigm for robotic manipulation; (2) an asymmetric cross-modal attention mechanism that disentangles perception and control; and (3) a lightweight, efficient multimodal representation learning framework specifically designed for contact-rich manipulation.

Technology Category

Application Category

📝 Abstract

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.

Problem

Research questions and friction points this paper is trying to address.

Learning multisensory robot manipulation under sensory noise and dynamics changes

Developing pretrained representations for vision, force, and proprioception fusion

Accelerating policy learning for contact-rich tasks with limited real-world interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pretraining with masked autoencoding for multisensory data

Asymmetric actor-critic architecture with cross-attention mechanism

Transformer encoder reconstructing observations from partial sensor embeddings

🔎 Similar Papers

Canonical Representation and Force-Based Pretraining of 3D Tactile for Dexterous Visuo-Tactile Policy Learning

2024-09-26arXiv.orgCitations: 1