Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

In text-to-image generation, high-quality outputs often require domain-specific prompt engineering, and existing prompt optimization methods rely heavily on human annotations and biased external aesthetic evaluators. Method: We propose the first annotation-free, external-evaluator-free self-feedback prompt optimization framework. It leverages a large vision-language model (LVLM) to jointly serve as both a prompt rewriter and an image–text alignment/aesthetic reward model—enabling “generate-as-judge” self-supervised reinforcement learning. Contribution/Results: Our core innovation lies in the LVLM’s self-rewarding mechanism and dual-role co-optimization. Evaluated on two mainstream benchmarks, our method significantly outperforms strong baselines, simultaneously improving both aesthetic quality and text-image alignment. This validates the efficacy of self-feedback paradigms in prompt engineering.

Technology Category

Application Category

📝 Abstract

Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.

Problem

Research questions and friction points this paper is trying to address.

Reduces dependence on manual data for prompt optimization

Uses LVLMs to rewrite and score prompts automatically

Unifies solver and reward model for self-improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

LVLMs rewrite user prompts for optimization

LVLMs score images using AI feedback

Unified model iterates via reinforcement learning

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling