K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

📅 2024-08-26

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Traditional pairwise arena-style preference comparisons suffer from slow convergence and high sensitivity to annotation noise, hindering scalable evaluation of large generative models. To address this, we propose a novel *K*-wise human preference comparison paradigm, enabling simultaneous, unconstrained comparison of *K* models in a single interaction—substantially improving information efficiency. We further design an exploration-exploitation-driven intelligent matching strategy that integrates probabilistic modeling with Bayesian dynamic updating to enhance robustness against noisy or inconsistent judgments. Additionally, we develop a real-time multimodal leaderboard system—supporting text-to-image and text-to-video generation—based on a customized ELO variant. Experiments demonstrate that our method achieves 16.3× faster convergence than standard ELO-based ranking. The framework is open-sourced and has been deployed in production for large-scale model evaluation.

Technology Category

Application Category

📝 Abstract

The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena

Problem

Research questions and friction points this paper is trying to address.

Efficient evaluation of generative models using human preferences.

Reducing comparison numbers and noise in model ranking.

Enhancing robustness with probabilistic modeling and Bayesian techniques.

Innovation

Methods, ideas, or system contributions that make the work stand out.

K-wise comparisons for richer model evaluations

Probabilistic modeling enhances system robustness

Exploration-exploitation strategy for informative matchmaking

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions