Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the tendency of large diffusion-based vision-language models to generate repetitive text and exhibit degraded visual grounding when producing long descriptions. The study identifies the root causes as representation drift induced by mask token priors and weakened visual attention stemming from a mismatch between positional attention bias and the iterative demasking process. To mitigate these issues, the authors propose two training-free, plug-and-play strategies: Mask Prior Suppression and Monotonic RoPE Scaling. These methods consistently enhance performance across diverse diffusion architectures, yielding significant improvements on multimodal understanding and visual grounding benchmarks, with particularly robust gains in long-form caption generation tasks.
📝 Abstract
Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.
Problem

Research questions and friction points this paper is trying to address.

mask prior drift
positional attention collapse
repetitive generation
visual grounding
large diffusion vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask Prior Suppression
Monotonic RoPE Scaling
Diffusion Vision-Language Models
Positional Attention Collapse
Training-Free Adaptation