Unified Thinker: A General Reasoning Modular Core for Image Generation

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the disconnect between reasoning and execution in generative models when handling logic-intensive instruction-following tasks. To bridge this gap, the authors propose Unified Thinker, a unified reasoning architecture featuring a task-agnostic modular design that decouples a dedicated reasoning module from the image generator. A pluggable planning core is introduced to guide the generation process, supported by a structured planning interface and a two-stage training strategy: first constructing the interface, then optimizing it via reinforcement learning with pixel-level feedback. This design enables independent upgrades of reasoning capabilities without retraining the entire system. Experiments demonstrate that the framework significantly improves logical consistency and generation quality in both text-to-image synthesis and image editing tasks, effectively narrowing the performance gap between open-source models and proprietary systems.

Technology Category

Application Category

📝 Abstract
Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
Problem

Research questions and friction points this paper is trying to address.

reasoning--execution gap
logic-intensive instruction following
reasoning-driven image generation
open-source generative models
Innovation

Methods, ideas, or system contributions that make the work stand out.

executable reasoning
modular reasoning
reasoning-generation decoupling
reinforcement learning with pixel feedback
task-agnostic planning
🔎 Similar Papers
No similar papers found.