Multi-Modal Manipulation via Multi-Modal Policy Consensus

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Two key challenges hinder multimodal fusion in robotic manipulation: (1) dominant modalities (e.g., vision) suppressing critical sparse signals (e.g., touch), and (2) inflexible adaptation to novel or missing modalities. To address these, we propose DiffRouter—a diffusion-based adaptive multimodal fusion framework integrating learnable routing. Instead of feature concatenation, DiffRouter employs independent diffusion models per modality and a lightweight router that dynamically learns inter-modal consensus weights, enabling complementary integration and adaptive switching—especially for contact-rich tasks. Crucially, it supports zero-shot incremental modality addition without retraining. Evaluated on RLBench simulations and real-world setups, DiffRouter achieves significant performance gains over baselines in occluded grasping, in-hand rotation, and puzzle insertion. Perturbation analysis confirms its capability to dynamically recalibrate modality importance, demonstrating strong robustness and generalization across diverse sensing configurations.

Technology Category

Application Category

📝 Abstract

Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.

Problem

Research questions and friction points this paper is trying to address.

Integrating diverse sensory modalities for robotic manipulation tasks

Overcoming vision dominance over sparse critical signals like touch

Enabling flexible incorporation of new or missing modalities without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Factorizes policy into specialized diffusion models per modality

Uses router network to adaptively combine modality contributions

Enables incremental addition of new representations without retraining

🔎 Similar Papers

Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

2024-03-07RLJCitations: 1