🤖 AI Summary
Diffusion-based policies suffer from poor robustness—particularly in long-horizon, multi-stage robotic manipulation tasks—where failure in one subtask impedes recovery, and from uninterpretable latent representations. To address these bottlenecks, we propose MoE-Diffusion: a novel framework embedding Mixture-of-Experts (MoE) layers into diffusion policies, enabling observation-driven, dynamic skill decomposition and routing via a learnable gating mechanism; each expert specializes in a semantically distinct task phase. Crucially, the modular structure permits failure diagnosis and task sequence reordering without retraining, substantially enhancing fault recovery. Integrated with a vision encoder and diffusion policy, MoE-Diffusion achieves an average 36% relative success rate improvement across six simulated long-horizon perturbation tasks and demonstrates effectiveness on real robotic hardware. This work presents the first diffusion-based visuomotor control framework that simultaneously delivers high robustness and interpretable, compositional skill structure.
📝 Abstract
Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any re-training.Our video and code are available at the https://moe-dp-website.github.io/MoE-DP-Website/.