Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This paper addresses the challenges of modality heterogeneity and tight temporal coupling between visual state prediction and action generation in robot policy learning. We propose DUST, a dual-stream diffusion framework. Its core innovations include: (1) a decoupled visual–action dual-stream architecture with cross-modal shared representation; (2) independent noise modeling and decoupled flow-matching losses to enable asynchronous evolution and joint sampling of states and actions; and (3) end-to-end policy learning via a multimodal diffusion Transformer integrated with a world model. On RoboCasa and GR-1 benchmarks, DUST achieves a 6% absolute improvement over strong baselines; test-time scaling yields an additional 2–5% gain, and real-robot task success rates increase by 13%. Moreover, the pre-trained model demonstrates strong cross-task transferability.

Technology Category

Application Category

📝 Abstract

Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.

Problem

Research questions and friction points this paper is trying to address.

Jointly predicting next-state observations and robotic action sequences

Handling modality conflict between vision and action in VLAs

Enhancing robotic policy learning through world-model augmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream diffusion handles modality conflict

Independent noise and loss decouple modalities

Joint sampling enables test-time scaling rates

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models