π€ AI Summary
This work addresses the limited generalization and interpretability in language-conditioned robotic manipulation, stemming from the decoupling of high-level intent and low-level actions. We propose DAWNβthe first end-to-end framework unifying diffusion models for both high-level policy planning and low-level action generation. Its core innovation is a structured pixel-motion representation serving as an interpretable intermediate abstraction, explicitly encoding instruction-driven visual motion priors. This design enables joint optimization, cross-task transfer, and robust Sim2Real deployment. On the CALVIN benchmark, DAWN achieves state-of-the-art performance; its generalization is further validated across diverse tasks in MetaWorld. Crucially, it attains stable physical-world control with only minimal fine-tuning on real-world data.
π Abstract
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://nero1342.github.io/DAWN/