Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image generation models often suffer from inadequate text–image alignment when processing long prompts involving complex scenes, multiple objects, and intricate spatial relationships. To address this, we propose SCoPE—a training-free method that hierarchically decomposes long prompts top-down into a sequence of coarse-to-fine sub-prompts and dynamically fuses their embeddings via interpolation during Stable Diffusion inference, enabling progressive detail refinement. Our key contribution is the first training-free hierarchical prompt scheduling mechanism—Coarse-to-Fine Prompt Scheduling—which explicitly models semantic refinement through prompt decomposition and layered embedding interpolation. Evaluated on GenAI-Bench, SCoPE improves average VQA accuracy by +4% across 85% of long prompts, significantly enhancing fine-grained semantic alignment. The method is plug-and-play and compatible with mainstream diffusion models.

Technology Category

Application Category

📝 Abstract
Text-to-image generative models often struggle with long prompts detailing complex scenes, diverse objects with distinct visual characteristics and spatial relationships. In this work, we propose SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings), a training-free method to improve text-to-image alignment by progressively refining the input prompt in a coarse-to-fine-grained manner. Given a detailed input prompt, we first decompose it into multiple sub-prompts which evolve from describing broad scene layout to highly intricate details. During inference, we interpolate between these sub-prompts and thus progressively introduce finer-grained details into the generated image. Our training-free plug-and-play approach significantly enhances prompt alignment, achieves an average improvement of up to +4% in Visual Question Answering (VQA) scores over the Stable Diffusion baselines on 85% of the prompts from the GenAI-Bench dataset.
Problem

Research questions and friction points this paper is trying to address.

Improves alignment in text-to-image models for complex scenes
Refines input prompts progressively from coarse to fine details
Enhances prompt alignment without requiring additional training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive refinement of input prompts
Coarse-to-fine prompt decomposition
Training-free interpolation during inference
🔎 Similar Papers
No similar papers found.