DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current vision-language-action (VLA) models rely heavily on image prediction, leading to information redundancy and insufficient modeling of critical world knowledge—such as physical dynamics, spatial relationships, and semantic structure—thereby limiting generalization and causal reasoning. To address this, we propose a novel VLA framework featuring: (1) a dynamic region-guided world knowledge prediction mechanism that explicitly models physical dynamics and spatial relations; (2) block-structured attention to decouple cross-modal interference; and (3) a hybrid diffusion-Transformer architecture for inverse-dynamics modeling of action distributions. Our approach establishes a closed-loop “perceive–predict–act” reasoning chain. Evaluated on real-world robotic tasks, it achieves a success rate of 76.7% and attains an average task length of 4.44 on the CALVIN ABC-D benchmark—demonstrating substantial improvements in cross-task generalization and causal reasoning capability.

Technology Category

Application Category

📝 Abstract

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Improves robot manipulation by integrating comprehensive world knowledge

Addresses redundant information in image-based forecasting methods

Enhances dynamic, spatial, and semantic information for action planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic-region-guided world knowledge prediction

Block-wise structured attention mechanism

Diffusion-based transformer for action modeling

🔎 Similar Papers

No similar papers found.