RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Current text-to-image models struggle to achieve high-fidelity alignment with complex prompts, and existing inference-time methods lack adaptability and generalization. This work proposes a training-free, evolutionary inference framework that dynamically evaluates generated images against a structured checklist of requirements and adaptively composes refinement operations—including prompt rewriting, noise resampling, and instruction editing—to iteratively optimize outputs. For the first time, this approach enables requirement-driven, test-time adaptive computation scaling, achieving state-of-the-art alignment performance on GenEval (0.94) and DrawBench. Moreover, it significantly enhances efficiency and generalizability by reducing the number of generated samples by 30–40% and cutting visual-language model calls by 80%.

Technology Category

Application Category

📝 Abstract

Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.

Problem

Research questions and friction points this paper is trying to address.

text-to-image alignment

complex prompts

training-free methods

adaptive inference

requirement fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

requirement-adaptive

evolutionary refinement