DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

Current unified multimodal large language models (MLLMs) face two key bottlenecks in text-to-image generation: (1) purely textual chain-of-thought (CoT) reasoning suffers from coarse granularity and struggles with rare attribute compositions; (2) end-to-end generation lacks controllable, interpretable reasoning. To address these, we propose DraCo, the first framework to employ low-resolution visual sketches as *draft-as-CoT*—a concrete, semantic-driven chain of visual reasoning and iterative refinement. Our method integrates sketch preview generation, semantic alignment self-checking, and selective super-resolution repair, augmented by a novel DraCo-CFG guidance strategy for interleaved reasoning training. Evaluated on GenEval, Imagine-Bench, and GenEval++, DraCo significantly outperforms both direct generation and existing CoT-based approaches—achieving improvements of 8%, 0.91 points, and 3%, respectively—demonstrating the efficacy of sketch-guided fine-grained control and robust generation of rare concept combinations.

Technology Category

Application Category

📝 Abstract

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

Problem

Research questions and friction points this paper is trying to address.

Improves text-to-image generation via visual draft previews.

Addresses coarse textual planning and rare attribute generation.

Enhances multimodal reasoning with interleaved text-visual verification.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates low-resolution draft images for visual planning

Verifies semantic alignment between draft and prompt

Refines images through selective super-resolution corrections

🔎 Similar Papers

No similar papers found.