Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based multimodal language models often suffer from premature answer generation and insufficient utilization of visual inputs when integrated with chain-of-thought reasoning, limiting their reasoning performance. To address these issues, this work proposes a Position and Step Penalty (PSP) mechanism to delay answer generation and introduces a Visual Reasoning Guidance (VRG) strategy, inspired by classifier-free guidance, to enhance the involvement of visual signals. This approach promotes adequate visual grounding and progressive reasoning during parallel generation. It is the first systematic solution to the problems of answer prematurity and weak visual dependency in diffusion-based multimodal models, achieving up to a 7.5% accuracy gain across multiple benchmarks and over threefold inference acceleration with less than one-quarter of the original diffusion steps.
📝 Abstract
Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues. First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs. To address these limitations, we propose Position and Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by classifier-free guidance, amplifies visual grounding signals to enhance the model's alignment with visual evidence. Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps.
Problem

Research questions and friction points this paper is trying to address.

diffusion multimodal language models
Chain-of-Thought reasoning
visual grounding
premature answer generation
reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion multimodal language models
Chain-of-Thought reasoning
visual grounding
Position and Step Penalty
Visual Reasoning Guidance
🔎 Similar Papers
No similar papers found.
K
Keuntae Kim
Department of Computer Science, Hanyang University
Mingyu Kang
Mingyu Kang
UC Berkeley
quantum physicsquantum computing
Y
Yong Suk Choi
Department of Computer Science, Hanyang University; Department of Artificial Intelligence, Hanyang University