π€ AI Summary
Robotic multi-task learning faces challenges including highly multimodal action distributions, insufficient representational capacity of monolithic models, and poor adaptability. To address these, we propose a modular diffusion policy framework featuring a novel factorized-diffusion architecture that decomposes complex action distributions into composable, addable, and removable specialized diffusion submodules. This design enables disentangled modeling of behavioral submodalities and supports incremental fine-tuning of individual components. Critically, it achieves zero-forgetting task expansion and efficient sim-to-real transfer. Experiments demonstrate that our method consistently outperforms both strong modular and monolithic baselines in both simulated and real-robot manipulation tasks, yielding substantial improvements in cross-task generalization and adaptation efficiency.
π Abstract
Multitask learning poses significant challenges due to the highly multimodal and diverse nature of robot action distributions. However, effectively fitting policies to these complex task distributions is often difficult, and existing monolithic models often underfit the action distribution and lack the flexibility required for efficient adaptation. We introduce a novel modular diffusion policy framework that factorizes complex action distributions into a composition of specialized diffusion models, each capturing a distinct sub-mode of the behavior space for a more effective overall policy. In addition, this modular structure enables flexible policy adaptation to new tasks by adding or fine-tuning components, which inherently mitigates catastrophic forgetting. Empirically, across both simulation and real-world robotic manipulation settings, we illustrate how our method consistently outperforms strong modular and monolithic baselines.