π€ AI Summary
This paper addresses the challenges of modality heterogeneity and tight temporal coupling between visual state prediction and action generation in robot policy learning. We propose DUST, a dual-stream diffusion framework. Its core innovations include: (1) a decoupled visualβaction dual-stream architecture with cross-modal shared representation; (2) independent noise modeling and decoupled flow-matching losses to enable asynchronous evolution and joint sampling of states and actions; and (3) end-to-end policy learning via a multimodal diffusion Transformer integrated with a world model. On RoboCasa and GR-1 benchmarks, DUST achieves a 6% absolute improvement over strong baselines; test-time scaling yields an additional 2β5% gain, and real-robot task success rates increase by 13%. Moreover, the pre-trained model demonstrates strong cross-task transferability.
π Abstract
Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.