Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitation of autoregressive language models, whose left-to-right generation order leads to significant performance degradation in non-standard reasoning scenarios—such as answer-first settings—where the answer must be determined before intermediate reasoning steps. To mitigate this issue, the authors propose Masked Diffusion Language Models (MDLM), which decouple output order from reasoning order by enabling parallel, iterative refinement of all tokens. A new benchmark, ReasonOrderQA, is introduced to evaluate model sensitivity to reasoning order under controllable difficulty levels. Multi-stage diffusion analysis reveals that MDLM prioritizes stabilizing simpler reasoning steps early in the generation process. Experimental results demonstrate that MDLM exhibits strong order robustness, with performance drops of no more than 14% under answer-first conditions on GSM8K, Math500, and ReasonOrderQA, substantially outperforming autoregressive models, which suffer degradations of up to 67%.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term"order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.

Problem

Research questions and friction points this paper is trying to address.

output order

reasoning order

order robustness

autoregressive language models

diffusion language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked diffusion language models

order robustness

non-autoregressive generation