No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image (T2I) models often omit or misrepresent semantic elements under complex prompts, resulting in low generation fidelity. To address this, we propose a fine-grained test-time optimization framework: first decomposing the prompt into atomic semantic concepts, then jointly aligning the generated image with the text at both global and concept-level granularities. Our key innovation is a concept-level alignment mechanism—leveraging a fine-grained CLIP variant to localize missing or erroneous concepts—and integrating a large language model to iteratively refine the prompt. Crucially, our method requires no model fine-tuning. Evaluated on DrawBench and CompBench, it substantially outperforms strong baselines, achieving state-of-the-art performance in both concept coverage and human-evaluated fidelity.

Technology Category

Application Category

📝 Abstract
Despite recent advances in text-to-image (T2I) models, they often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization has emerged as a promising approach to address this limitation by refining generation without the need for retraining. In this paper, we propose a fine-grained test-time optimization framework that enhances compositional faithfulness in T2I generation. Unlike most of prior approaches that rely solely on a global image/text similarity score, our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels. A fine-grained variant of CLIP is used to compute concept-level correspondence, producing detailed feedback on missing or inaccurate concepts. This feedback is fed into an iterative prompt refinement loop, enabling the large language model to propose improved prompts. Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness over both standard test-time optimization and the base T2I model. Code is available at: https://github.com/AmirMansurian/NoConceptLeftBehind
Problem

Research questions and friction points this paper is trying to address.

Addresses compositional faithfulness in text-to-image generation models
Improves concept coverage through fine-grained test-time optimization
Reduces omission and misrepresentation of objects in complex prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained test-time optimization for compositional T2I generation
Decomposes prompts into concepts for multi-level alignment evaluation
Iterative prompt refinement using concept-level feedback from CLIP
🔎 Similar Papers
No similar papers found.