SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Block-level discrete diffusion models offer parallel generation and causal modeling advantages for vision-language understanding (VLU), yet have long lagged behind autoregressive baselines due to high training cost, slow convergence, and instability. This work presents the first systematic application of such models to large-scale VLU, introducing three synergistic innovations: asynchronous block-wise noise scheduling, effective mask ratio scaling, and progressive Beta noise curriculum—collectively enhancing training stability, efficiency, and generalization. We further integrate dynamic mask normalization, multimodal alignment, and contrastive learning. Evaluated across 21 benchmarks—including single-image, multi-image, and video tasks—the model consistently outperforms prior diffusion-based approaches and matches or exceeds strong autoregressive models like LLaVA-OneVision. Our results establish block-level discrete diffusion as a practical, competitive backbone architecture for VLU.

Technology Category

Application Category

📝 Abstract
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present extbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) extbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) extbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a extbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves emph{training efficiency}, emph{convergence stability}, and emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
Problem

Research questions and friction points this paper is trying to address.

Addresses high training cost and slow convergence in block-wise diffusion models
Solves training instability issues in vision-language understanding diffusion models
Improves efficiency and performance of diffusion models for multimodal tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous block-wise noise scheduling for diverse batch supervision
Effective mask ratio scaling for unbiased loss normalization
Progressive beta noise curriculum to increase mask coverage
🔎 Similar Papers
No similar papers found.
S
Shuang Cheng
Zhejiang University
Yuhua Jiang
Yuhua Jiang
Tsinghua University
reinforcement learning
Z
Zineng Zhou
ByteDance
Dawei Liu
Dawei Liu
Shanghai AI Laboratory
W
Wang Tao
ByteDance
Linfeng Zhang
Linfeng Zhang
DP Technology; AI for Science Institute
AI for Sciencemulti-scale modelingmolecular simulationdrug/materials design
B
Biqing Qi
Shanghai AI Laboratory
B
Bowen Zhou
Shanghai AI Laboratory