EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the low fidelity of current text-to-image generation models in handling complex compositional prompts involving multiple objects, quantities, attributes, and spatial or semantic relationships. The authors propose a training-free inference-time optimization framework that first parses the input prompt into a structured visual program and then dynamically verifies generated image content through a predicate-guided search mechanism. Based on this verification, the method selectively applies either resampling or localized editing to iteratively refine the output. By integrating natural language parsing, visual predicate validation, and strategic calls to multimodal large language models (MLLMs), the approach significantly improves prompt adherence—boosting accuracy from 34.16% to 71.46% on GenEval2, surpassing the strongest baseline by 19.23 percentage points—while simultaneously reducing image model invocations by 31%, MLLM calls by 72%, and overall token consumption by 81%.

📝 Abstract

Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.

Problem

Research questions and friction points this paper is trying to address.

compositional text-to-image generation

predicate-guided control

inference-time refinement

visual program verification

multi-object image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

predicate-guided search

compositional text-to-image generation

inference-time refinement