Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

πŸ“… 2025-10-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the challenges of modality heterogeneity and tight temporal coupling between visual state prediction and action generation in robot policy learning. We propose DUST, a dual-stream diffusion framework. Its core innovations include: (1) a decoupled visual–action dual-stream architecture with cross-modal shared representation; (2) independent noise modeling and decoupled flow-matching losses to enable asynchronous evolution and joint sampling of states and actions; and (3) end-to-end policy learning via a multimodal diffusion Transformer integrated with a world model. On RoboCasa and GR-1 benchmarks, DUST achieves a 6% absolute improvement over strong baselines; test-time scaling yields an additional 2–5% gain, and real-robot task success rates increase by 13%. Moreover, the pre-trained model demonstrates strong cross-task transferability.

Technology Category

Application Category

πŸ“ Abstract
Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.
Problem

Research questions and friction points this paper is trying to address.

Jointly predicting next-state observations and robotic action sequences
Handling modality conflict between vision and action in VLAs
Enhancing robotic policy learning through world-model augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream diffusion handles modality conflict
Independent noise and loss decouple modalities
Joint sampling enables test-time scaling rates
πŸ”Ž Similar Papers
No similar papers found.
J
John Won
KAIST
K
Kyungmin Lee
KAIST
Huiwon Jang
Huiwon Jang
KAIST
D
Dongyoung Kim
KAIST, RLWRLD
Jinwoo Shin
Jinwoo Shin
ICT Endowed Chair Professor
Machine LearningDeep Learning