ThinkGen: Generalized Thinking for Visual Generation

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chain-of-thought (CoT) reasoning methods exhibit weak generalization and poor scene adaptability in multimodal large language models (MLLMs) for visual generation tasks. To address this, we propose the first thought-driven universal visual generation framework. Our method decouples the MLLM from the diffusion Transformer (DiT): the MLLM interprets user intent to generate executable, customized instructions, while the DiT performs high-fidelity image synthesis based on these instructions. We introduce a task-agnostic CoT reasoning paradigm specifically designed for visual generation. Furthermore, we propose a separable GRPO (Generalized Reinforcement Policy Optimization) reinforcement learning mechanism that enables alternating optimization of the MLLM and DiT, facilitating cross-scenario joint training. Evaluated on multiple visual generation benchmarks, our framework achieves state-of-the-art performance, significantly improving complex intent understanding and open-domain generalization capability.

Technology Category

Application Category

📝 Abstract
Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen
Problem

Research questions and friction points this paper is trying to address.

Extends Chain-of-Thought reasoning to visual generation tasks
Overcomes scenario-specific limitations for better generalization and adaptation
Proposes a framework for joint training across diverse generative datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled MLLM and DiT architecture for visual generation
SepGRPO training paradigm alternating reinforcement learning
Generalized CoT reasoning across diverse generative scenarios
🔎 Similar Papers
No similar papers found.