Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Autoregressive vision-language models (VLMs) suffer from sequential generation constraints, limiting their efficiency in complex visual planning and real-time robotic control. To address this, we propose Dream-VL and Dream-VLA—open-vocabulary vision-language and vision-language-action frameworks built upon diffusion language models (dLLMs), the first to adopt dLLMs as the backbone for VLMs/VLAs. Leveraging dLLMs’ native bidirectional modeling capability, our approach enables parallel action token generation and fine-grained action chunking. The method integrates continuous pretraining, large-scale open robotics datasets (LIBERO/SimplerEnv), and joint vision-language alignment with action decoding. Experiments show that Dream-VL matches state-of-the-art open-source autoregressive VLMs on visual understanding tasks, while Dream-VLA achieves a 97.2% average success rate on LIBERO—substantially outperforming π₀, GR00T-N1, and other leading models. These results demonstrate the effectiveness and generalization advantage of the diffusion paradigm for vision-language-action modeling.

Technology Category

Application Category

📝 Abstract

While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.

Problem

Research questions and friction points this paper is trying to address.

Develops diffusion-based vision-language models for complex visual planning

Enhances robotic control via bidirectional action chunking and parallel generation

Surpasses autoregressive models in convergence speed and task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based backbone for vision-language models

Continuous pre-training on open robotic datasets

Bidirectional nature enabling parallel action generation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs