Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

In text-to-visual generation, misalignment between user intent and generated outputs remains a persistent challenge, with single-step generation often failing to meet precise requirements. To address this, we propose PRIS (Prompt Revision via Inference-time Self-refinement), the first framework enabling joint scaling of prompt optimization and visual generation. PRIS establishes a closed-loop prompt refinement mechanism comprising adaptive prompt revision, feedback-driven analysis of generated outputs, and an element-level factual correctness validator. Crucially, this validator enables fine-grained, interpretable alignment assessment. Evaluated on both text-to-image and text-to-video generation tasks, PRIS achieves substantial quality improvements—yielding a 15% average score gain on the VBench 2.0 benchmark. These results empirically validate the effectiveness and generalizability of co-optimizing prompts and generation during inference.

Technology Category

Application Category

📝 Abstract

Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.

Problem

Research questions and friction points this paper is trying to address.

Addresses misalignment between text prompts and generated visuals in text-to-image/video models

Overcomes quality plateaus from fixed prompts during inference-time scaling

Improves fine-grained attribute matching through adaptive prompt redesign

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptively revises prompts during inference for better visual generation

Introduces element-level factual correction for fine-grained alignment feedback

Scales prompts and visuals jointly to leverage inference-time scaling laws

🔎 Similar Papers

No similar papers found.