RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing automatic evaluation metrics for topic-driven text-to-image generation suffer from fragmented dimensional assessment, poor correlation with human judgments, or reliance on costly external APIs, failing to jointly and reliably assess textual alignment and subject consistency. Method: We propose RefVNLI—the first lightweight, single-model metric that jointly evaluates both core dimensions. It innovatively constructs a large-scale training dataset by integrating video reasoning data with image perturbations, and employs multi-task contrastive learning, cross-modal feature alignment, and a lightweight ViT-CLIP fusion architecture. Contribution/Results: RefVNLI achieves state-of-the-art performance across multiple benchmarks and subject categories, improving textual alignment by up to 6.4 points and subject consistency by 8.5 points over baselines. It attains an 87.2% agreement rate with human preferences and demonstrates significantly enhanced robustness for evaluating rare concepts.

Technology Category

Application Category

📝 Abstract

Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., emph{Animal}, emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87% accuracy.

Problem

Research questions and friction points this paper is trying to address.

Lack of reliable automatic evaluation for subject-driven T2I generation

Existing methods assess only one aspect or misalign with human judgments

Need for cost-effective metric evaluating both textual alignment and subject preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

RefVNLI evaluates alignment and preservation jointly

Uses large-scale video-reasoning dataset for training

Outperforms baselines in textual and subject metrics

🔎 Similar Papers

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation