Multi-Modal Manipulation via Multi-Modal Policy Consensus

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Two key challenges hinder multimodal fusion in robotic manipulation: (1) dominant modalities (e.g., vision) suppressing critical sparse signals (e.g., touch), and (2) inflexible adaptation to novel or missing modalities. To address these, we propose DiffRouter—a diffusion-based adaptive multimodal fusion framework integrating learnable routing. Instead of feature concatenation, DiffRouter employs independent diffusion models per modality and a lightweight router that dynamically learns inter-modal consensus weights, enabling complementary integration and adaptive switching—especially for contact-rich tasks. Crucially, it supports zero-shot incremental modality addition without retraining. Evaluated on RLBench simulations and real-world setups, DiffRouter achieves significant performance gains over baselines in occluded grasping, in-hand rotation, and puzzle insertion. Perturbation analysis confirms its capability to dynamically recalibrate modality importance, demonstrating strong robustness and generalization across diverse sensing configurations.

Technology Category

Application Category

📝 Abstract
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.
Problem

Research questions and friction points this paper is trying to address.

Integrating diverse sensory modalities for robotic manipulation tasks
Overcoming vision dominance over sparse critical signals like touch
Enabling flexible incorporation of new or missing modalities without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Factorizes policy into specialized diffusion models per modality
Uses router network to adaptively combine modality contributions
Enables incremental addition of new representations without retraining
🔎 Similar Papers
No similar papers found.