Diffusion Transformer Policy

📅 2024-10-21
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action models are constrained by small, task-specific action heads, limiting their ability to model diverse, continuous action sequences across tasks and robotic platforms. This paper introduces the first end-to-end multimodal diffusion Transformer framework for embodied intelligence, eliminating conventional action heads and instead directly denoising continuous action tokens—conditioned solely on monocular visual observations—while jointly modeling spatiotemporal dynamics across vision, language, and action modalities. Its core innovation lies in pioneering the application of large-scale diffusion Transformers to continuous-action policy learning, substantially improving cross-environment generalization. Experiments demonstrate consistent superiority over OpenVLA and Octo across CALVIN (ABC→D), LIBERO, SimplerEnv, and a real-world Franka robot: success sequence length on CALVIN increases by over 1.2×, and the average number of consecutively completed tasks reaches 3.6.

Technology Category

Application Category

📝 Abstract
Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict individual discretized or continuous action by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action sequence with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head for action embedding. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate the effectiveness and generalization of Diffusion Transformer Policy on Maniskill2, Libero, Calvin and SimplerEnv, as well as the real-world Franka arm, achieving consistent better performance on Real-to-Sim benchmark SimplerEnv, real-world Franka Arm and Libero compared to OpenVLA and Octo. Specifically, without bells and whistles, the proposed approach achieves state-of-the-art performance with only a single third-view camera stream in the Calvin task ABC->D, improving the average number of tasks completed in a row of 5 to 3.6, and the pretraining stage significantly facilitates the success sequence length on the Calvin by over 1.2. Project Page: https://zhihou7.github.io/dit_policy_vla/
Problem

Research questions and friction points this paper is trying to address.

Modeling diverse action spaces effectively
Enhancing generalization across robot datasets
Improving continuous action sequence prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large multi-modal diffusion transformer
Denoising action chunks directly
Scaling capability of transformers
🔎 Similar Papers
No similar papers found.