🤖 AI Summary
Existing chain-of-thought (CoT) reasoning methods exhibit weak generalization and poor scene adaptability in multimodal large language models (MLLMs) for visual generation tasks. To address this, we propose the first thought-driven universal visual generation framework. Our method decouples the MLLM from the diffusion Transformer (DiT): the MLLM interprets user intent to generate executable, customized instructions, while the DiT performs high-fidelity image synthesis based on these instructions. We introduce a task-agnostic CoT reasoning paradigm specifically designed for visual generation. Furthermore, we propose a separable GRPO (Generalized Reinforcement Policy Optimization) reinforcement learning mechanism that enables alternating optimization of the MLLM and DiT, facilitating cross-scenario joint training. Evaluated on multiple visual generation benchmarks, our framework achieves state-of-the-art performance, significantly improving complex intent understanding and open-domain generalization capability.
📝 Abstract
Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen