Pixel Motion Diffusion is What We Need for Robot Control

πŸ“… 2025-09-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited generalization and interpretability in language-conditioned robotic manipulation, stemming from the decoupling of high-level intent and low-level actions. We propose DAWNβ€”the first end-to-end framework unifying diffusion models for both high-level policy planning and low-level action generation. Its core innovation is a structured pixel-motion representation serving as an interpretable intermediate abstraction, explicitly encoding instruction-driven visual motion priors. This design enables joint optimization, cross-task transfer, and robust Sim2Real deployment. On the CALVIN benchmark, DAWN achieves state-of-the-art performance; its generalization is further validated across diverse tasks in MetaWorld. Crucially, it attains stable physical-world control with only minimal fine-tuning on real-world data.

Technology Category

Application Category

πŸ“ Abstract
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://nero1342.github.io/DAWN/
Problem

Research questions and friction points this paper is trying to address.

Develops diffusion-based framework for robot control tasks
Bridges high-level motion intent with low-level robot actions
Achieves reliable real-world transfer with minimal finetuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion framework for language-conditioned robot control
Models controllers as diffusion processes with motion abstractions
Pixel motion representation bridges intent and robot actions
πŸ”Ž Similar Papers
No similar papers found.