BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the inefficiency of autoregressive vision-language models (VLMs) during inference and the significant performance degradation often incurred when naively converting them into diffusion-based large-block decoding models. To overcome these challenges, the authors propose BARD, a framework that enables the first efficient and lossless transfer from an autoregressive VLM to a diffusion VLM with identical architecture. Key innovations include progressive supervised block merging, staged in-diffusion-domain distillation, hybrid noise scheduling, and a memory-efficient training strategy. Experiments demonstrate successful transfer of Qwen3-VL using only ≤4.4M training samples, with both 4B and 8B variants achieving new state-of-the-art results among open-source diffusion VLMs. The approach yields up to a 3× improvement in decoding throughput, validating the superiority of in-diffusion-domain distillation over cross-paradigm alternatives.

Technology Category

Application Category

📝 Abstract

Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq 4.4M$ data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to \textbf{3$\times$} decoding throughput speedup compared to the source model.

Problem

Research questions and friction points this paper is trying to address.

autoregressive vision-language models

diffusion vision-language models

model conversion

performance degradation

decoding efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive block merging

stage-wise distillation

diffusion vision-language models