Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Current evaluation practices for text-to-image generation models predominantly rely on uniform annotation protocols that overlook the intrinsic differences among distinct assessment skills, resulting in unreliable evaluation signals. This work proposes a “skill-aligned annotation” strategy that systematically tailors annotation methodologies to the specific characteristics of each evaluation skill, thereby establishing a scalable, fine-grained, and spatially grounded automatic evaluation pipeline. Through comparative experiments, inter-annotator consistency analysis, and spatial grounding feedback, the proposed approach significantly enhances both annotator agreement and cross-model evaluation stability. Crucially, it achieves more reliable and efficient assessment of image generation quality without increasing the burden of manual annotation.

📝 Abstract

Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation with spatially grounded feedback. Our work highlights that improving the foundations of image evaluation can increase reliability and efficiency without simply scaling annotation effort. We hope this motivates further research on refining evaluation protocols as a central component of reliable model assessment.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

evaluation reliability

annotation alignment

inter-annotator agreement

evaluation protocols

Innovation

Methods, ideas, or system contributions that make the work stand out.

skill-aligned annotation

text-to-image evaluation

inter-annotator agreement