ThinkGen: Generalized Thinking for Visual Generation

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing chain-of-thought (CoT) reasoning methods exhibit weak generalization and poor scene adaptability in multimodal large language models (MLLMs) for visual generation tasks. To address this, we propose the first thought-driven universal visual generation framework. Our method decouples the MLLM from the diffusion Transformer (DiT): the MLLM interprets user intent to generate executable, customized instructions, while the DiT performs high-fidelity image synthesis based on these instructions. We introduce a task-agnostic CoT reasoning paradigm specifically designed for visual generation. Furthermore, we propose a separable GRPO (Generalized Reinforcement Policy Optimization) reinforcement learning mechanism that enables alternating optimization of the MLLM and DiT, facilitating cross-scenario joint training. Evaluated on multiple visual generation benchmarks, our framework achieves state-of-the-art performance, significantly improving complex intent understanding and open-domain generalization capability.

Technology Category

Application Category

📝 Abstract

Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen

Problem

Research questions and friction points this paper is trying to address.

Extends Chain-of-Thought reasoning to visual generation tasks

Overcomes scenario-specific limitations for better generalization and adaptation

Proposes a framework for joint training across diverse generative datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled MLLM and DiT architecture for visual generation

SepGRPO training paradigm alternating reinforcement learning

Generalized CoT reasoning across diverse generative scenarios

🔎 Similar Papers

Using a CNN Model to Assess Paintings' Creativity