🤖 AI Summary
Existing confidence-based parallel decoding methods in multimodal diffusion large language models overlook visual information redundancy, causing high-confidence tokens to share overlapping visual evidence and thereby diminishing decoding complementarity and efficiency. This work proposes Visual Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding strategy that introduces, for the first time, a Visual Redundancy Index (VRI) to quantify the overlap of visual grounding among tokens. Leveraging token-to-image attention mechanisms, VRCD prioritizes visually complementary token positions during decoding. The method substantially reduces both visual redundancy and residual positional entropy, yielding significant performance gains: on long-sequence generation tasks, it improves accuracy by 18.8% on M³CoT and by 6.9% on MMBench.
📝 Abstract
Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps. Existing confidence-based decoding ranks masked positions independently and commits the top-K positions, largely ignoring whether the committed tokens provide complementary visual grounding. We identify a step-level limitation of this strategy in multimodal settings: high-confidence tokens selected in the same step can rely on overlapping visual grounding, introducing visual redundancy among the committed tokens and leaving less complementary visual grounding available for later decoding. To quantify this effect, we introduce the Visual Redundancy Index (VRI), which measures visual grounding overlap among tokens committed in parallel. To control this redundancy during decoding, we propose Visual-Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding method that uses token-to-image attention to prioritize visually complementary positions. Across diverse multimodal benchmarks, VRCD reduces visual redundancy and remaining-position entropy with modest runtime overhead. In longer decoding experiments, it also achieves relative accuracy gains of up to 18.8% on M^3CoT and 6.9% on MMBench over confidence-based decoding. Code will be released at https://github.com/infiniteYuanyl/VRCD.