GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limitations of prevailing evaluation methods for vision generative models, which predominantly rely on pointwise scoring and suffer from stochastic inconsistency and poor alignment with human perception. To overcome these issues, the authors propose GenArena, a unified automated evaluation framework based on pairwise comparisons. GenArena systematically exposes the deficiencies of pointwise scoring and leverages vision-language models as proxy judges to enable efficient and stable model comparisons. The approach significantly improves alignment with human judgments, achieving a Spearman correlation coefficient of 0.86 with human assessments on the authoritative LMArena leaderboard—substantially outperforming pointwise methods (0.36) by over 20%. Furthermore, the framework demonstrates that open-source models can surpass leading closed-source counterparts under this more reliable evaluation paradigm.

Technology Category

Application Category

📝 Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

Problem

Research questions and friction points this paper is trying to address.

visual generation evaluation

human alignment

pointwise scoring

evaluation reliability

Vision-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

pairwise comparison

human-aligned evaluation

visual generation