UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work investigates the practical impact of reasoning processes on image generation within unified multimodal models and identifies a “reasoning paradox”: although reasoning traces outperform direct generation, retaining intermediate reasoning steps degrades output quality, with distilled reasoning prompts yielding optimal results. This reveals contextual interference as a critical bottleneck. To systematically examine this phenomenon, the authors introduce UReason, a diagnostic benchmark comprising 2,000 samples across five fine-grained reasoning tasks, and employ chain-of-thought prompting, context-stripped generation, and a multimodal evaluation framework. Experiments across eight open-source unified models demonstrate the widespread presence of this paradox, indicating that current models are primarily constrained by contextual interference rather than insufficient reasoning capacity, thereby offering a new direction for effectively integrating reasoning and generation.

Technology Category

Application Category

📝 Abstract

To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.

Problem

Research questions and friction points this paper is trying to address.

reasoning paradox

multimodal models

image generation

chain-of-thought reasoning

visual synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning paradox

multimodal reasoning

image generation