DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Diffusion Vision-Language Models (dVLMs) significantly underperform autoregressive (AR) multimodal models in both capability and efficiency. To address this, we propose the first general-purpose, lossless AR-to-dVLM transfer framework. Our method enables lightweight fine-tuning of any strong AR multimodal model to yield a dVLM capable of arbitrary-length generation. We introduce a block-wise decoding architecture that jointly optimizes long-sequence modeling and inference acceleration, while reusing KV caches to further improve computational efficiency. Remarkably, our approach achieves superior performance using only 5% of labeled data compared to state-of-the-art dVLMs. It attains +34.4% and +37.5% absolute gains on MMMU-Pro (visual understanding) and MME (cognitive reasoning), respectively, and doubles inference speed. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

Problem

Research questions and friction points this paper is trying to address.

Converting autoregressive vision-language models to diffusion paradigm

Improving diffusion multimodal model performance using existing AR models

Enhancing inference speed with block-decoding design for arbitrary-length generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Translates autoregressive models into diffusion vision-language models

Uses block-decoding for arbitrary-length generation and KV cache reuse

Achieves performance gains with less than 5% training data

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models