Towards Scalable Human-aligned Benchmark for Text-guided Image Editing

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Text-guided image editing lacks objective, scalable evaluation metrics, with existing approaches relying on time-consuming, labor-intensive human assessments. Method: We introduce HATIE, the first large-scale, human-perception-aligned automated benchmark for text-guided image editing, comprising a high-quality, multi-task benchmark dataset and an end-to-end evaluation pipeline. Contribution/Results: HATIE features (1) a multidimensional composite metric integrating fidelity, edit accuracy, consistency, and naturalness; (2) a learnable perceptual weighting scheme for aggregating sub-scores; and (3) extensive human calibration and statistical validation ensuring strong alignment with human preferences (Spearman ρ > 0.87). The benchmark enables reproducible, quantitative, and comparative evaluation of state-of-the-art models, effectively overcoming the limitations of subjective assessment.

Technology Category

Application Category

📝 Abstract
A variety of text-guided image editing models have been proposed recently. However, there is no widely-accepted standard evaluation method mainly due to the subjective nature of the task, letting researchers rely on manual user study. To address this, we introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE). Providing a large-scale benchmark set covering a wide range of editing tasks, it allows reliable evaluation, not limited to specific easy-to-evaluate cases. Also, HATIE provides a fully-automated and omnidirectional evaluation pipeline. Particularly, we combine multiple scores measuring various aspects of editing so as to align with human perception. We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects, and provide benchmark results on several state-of-the-art models to provide deeper insights on their performance.
Problem

Research questions and friction points this paper is trying to address.

Lack of standard evaluation for text-guided image editing
Need for human-aligned benchmark to assess editing quality
Automated pipeline combining multiple scores for reliable evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Human-Aligned benchmark for image editing
Provides large-scale diverse editing task dataset
Combines multiple scores for human-like evaluation
🔎 Similar Papers
No similar papers found.