Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models often suffer from detail misalignment and compositional distortion when generating images from prompts involving multiple subjects and complex attributes. To address this, we propose a **training-free progressive detail injection framework**: complex prompts are decomposed into sequential sub-prompts, and image generation proceeds in stages. Layout control is achieved via self-attention, while subject–attribute binding is enforced through cross-attention. Additionally, a test-time centroid alignment loss is introduced to refine the cross-attention mechanism, ensuring global compositional stability and precise fine-grained attribute binding. To our knowledge, this is the first method achieving multi-subject high-fidelity generation and consistent style control without any model training or fine-tuning. Extensive evaluation on T2I-CompBench and a newly constructed style-composition benchmark demonstrates significant improvements over state-of-the-art approaches—particularly in scenarios involving multi-object interactions and composite stylistic specifications.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.
Problem

Research questions and friction points this paper is trying to address.

Enhancing detail generation in text-to-image diffusion models
Handling complex prompts with multiple distinct subjects
Improving attribute binding and consistency in generated images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Detail Injection strategy
Cross-attention mechanisms utilization
Centroid Alignment Loss introduction
🔎 Similar Papers
No similar papers found.
L
Lifeng Chen
AGI Lab, Westlake University
J
Jiner Wang
AGI Lab, Westlake University
Zihao Pan
Zihao Pan
Meituan-M17 LongCat Team; Sun Yat-sen University
Generative ModelingMultimodal Large Language ModelsDiffusion Models
Beier Zhu
Beier Zhu
Research Scientist, Nanyang Technological University
Robust Machine Learning
X
Xiaofeng Yang
AGI Lab, Westlake University, Nanyang Technological University
C
Chi Zhang
AGI Lab, Westlake University