🤖 AI Summary
Current unified multimodal large language models (MLLMs) face two key bottlenecks in text-to-image generation: (1) purely textual chain-of-thought (CoT) reasoning suffers from coarse granularity and struggles with rare attribute compositions; (2) end-to-end generation lacks controllable, interpretable reasoning. To address these, we propose DraCo, the first framework to employ low-resolution visual sketches as *draft-as-CoT*—a concrete, semantic-driven chain of visual reasoning and iterative refinement. Our method integrates sketch preview generation, semantic alignment self-checking, and selective super-resolution repair, augmented by a novel DraCo-CFG guidance strategy for interleaved reasoning training. Evaluated on GenEval, Imagine-Bench, and GenEval++, DraCo significantly outperforms both direct generation and existing CoT-based approaches—achieving improvements of 8%, 0.91 points, and 3%, respectively—demonstrating the efficacy of sketch-guided fine-grained control and robust generation of rare concept combinations.
📝 Abstract
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.