Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

📅 2026-01-15

📈 Citations: 1

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the limitation of current text-to-image diffusion models, which merely employ large language models (LLMs) as text encoders while neglecting their reasoning capabilities, often resulting in semantically inconsistent or commonsense-deficient outputs. To bridge this gap, the authors propose a “Think-then-Generate” (T2G) paradigm that leverages lightweight supervised fine-tuning to unlock the LLM’s explicit reasoning and prompt-rewriting abilities, using its refined output as conditional input for the diffusion model. Furthermore, they introduce a Dual-GRPO algorithm to jointly optimize the LLM encoder and the diffusion backbone, complemented by an image-anchored reward mechanism that reinforces world-knowledge reasoning. This approach represents the first integration of LLM-based reasoning into the image generation pipeline, enabling a shift from literal mapping to semantic understanding. It achieves substantial improvements in factual consistency, semantic alignment, and visual realism across multiple reasoning-intensive generation and editing benchmarks, attaining a WISE score of 0.79—approaching GPT-4-level performance.

Technology Category

Application Category

📝 Abstract

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

reasoning-aware modeling

diffusion models

large language models

semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Think-Then-Generate

reasoning-aware diffusion

LLM encoder