Pixel Motion Diffusion is What We Need for Robot Control

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the limited generalization and interpretability in language-conditioned robotic manipulation, stemming from the decoupling of high-level intent and low-level actions. We propose DAWN—the first end-to-end framework unifying diffusion models for both high-level policy planning and low-level action generation. Its core innovation is a structured pixel-motion representation serving as an interpretable intermediate abstraction, explicitly encoding instruction-driven visual motion priors. This design enables joint optimization, cross-task transfer, and robust Sim2Real deployment. On the CALVIN benchmark, DAWN achieves state-of-the-art performance; its generalization is further validated across diverse tasks in MetaWorld. Crucially, it attains stable physical-world control with only minimal fine-tuning on real-world data.

Technology Category

Application Category

📝 Abstract

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://nero1342.github.io/DAWN/

Problem

Research questions and friction points this paper is trying to address.

Develops diffusion-based framework for robot control tasks

Bridges high-level motion intent with low-level robot actions

Achieves reliable real-world transfer with minimal finetuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion framework for language-conditioned robot control

Models controllers as diffusion processes with motion abstractions

Pixel motion representation bridges intent and robot actions

🔎 Similar Papers

No similar papers found.