CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image (T2I) models struggle to achieve precise spatial alignment between textual descriptions and generated images in complex scenes, primarily due to the decoupling of layout planning and image synthesis, which precludes dynamic, stepwise control. To address this, we propose a multimodal large language model (MLLM)-driven 3D layout-diffusion co-generation framework—the first to integrate chain-of-thought–style stepwise reasoning into T2I generation. Specifically, our method dynamically refines a 3D scene layout at each denoising step and tightly couples layout planning with image synthesis via semantic-depth joint conditioning and condition-aware attention mechanisms. Evaluated on a 3D scene benchmark, our approach improves spatial accuracy for complex scenes by 34.7% over the state of the art, demonstrating substantial gains in compositional fidelity and structural coherence.

Technology Category

Application Category

📝 Abstract
Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.
Problem

Research questions and friction points this paper is trying to address.

Improves spatial alignment in text-to-image generation
Integrates layout planning with diffusion process dynamically
Enhances complex scene accuracy via step-by-step reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates MLLM-driven 3D layout planning with diffusion
Dynamically updates 3D layout during denoising steps
Fuses layout into diffusion via condition-aware attention
🔎 Similar Papers
No similar papers found.
Z
Zheyuan Liu
School of Electrical and Computer Engineering, Peking University
Munan Ning
Munan Ning
Peking University
Qihui Zhang
Qihui Zhang
Peking University
Human AlignmentMulti-ModalityLarge Language Model
S
Shuo Yang
School of Electrical and Computer Engineering, Peking University
Z
Zhongrui Wang
School of Electrical and Computer Engineering, Peking University
Y
Yiwei Yang
Shanghai Jiao Tong University
X
Xianzhe Xu
Hupan Lab
Yibing Song
Yibing Song
Deputy Chief Engineer, BYD Group
Multi-Modal AI
Weihua Chen
Weihua Chen
Alibaba DAMO Academy, previously NLPR, CASIA
Computer Vision
F
Fan Wang
DAMO Academy, Alibaba Group
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants