🤖 AI Summary
Two key challenges hinder multimodal fusion in robotic manipulation: (1) dominant modalities (e.g., vision) suppressing critical sparse signals (e.g., touch), and (2) inflexible adaptation to novel or missing modalities. To address these, we propose DiffRouter—a diffusion-based adaptive multimodal fusion framework integrating learnable routing. Instead of feature concatenation, DiffRouter employs independent diffusion models per modality and a lightweight router that dynamically learns inter-modal consensus weights, enabling complementary integration and adaptive switching—especially for contact-rich tasks. Crucially, it supports zero-shot incremental modality addition without retraining. Evaluated on RLBench simulations and real-world setups, DiffRouter achieves significant performance gains over baselines in occluded grasping, in-hand rotation, and puzzle insertion. Perturbation analysis confirms its capability to dynamically recalibrate modality importance, demonstrating strong robustness and generalization across diverse sensing configurations.
📝 Abstract
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.