K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

📅 2024-08-26
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Traditional pairwise arena-style preference comparisons suffer from slow convergence and high sensitivity to annotation noise, hindering scalable evaluation of large generative models. To address this, we propose a novel *K*-wise human preference comparison paradigm, enabling simultaneous, unconstrained comparison of *K* models in a single interaction—substantially improving information efficiency. We further design an exploration-exploitation-driven intelligent matching strategy that integrates probabilistic modeling with Bayesian dynamic updating to enhance robustness against noisy or inconsistent judgments. Additionally, we develop a real-time multimodal leaderboard system—supporting text-to-image and text-to-video generation—based on a customized ELO variant. Experiments demonstrate that our method achieves 16.3× faster convergence than standard ELO-based ranking. The framework is open-sourced and has been deployed in production for large-scale model evaluation.

Technology Category

Application Category

📝 Abstract
The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena
Problem

Research questions and friction points this paper is trying to address.

Efficient evaluation of generative models using human preferences.
Reducing comparison numbers and noise in model ranking.
Enhancing robustness with probabilistic modeling and Bayesian techniques.
Innovation

Methods, ideas, or system contributions that make the work stand out.

K-wise comparisons for richer model evaluations
Probabilistic modeling enhances system robustness
Exploration-exploitation strategy for informative matchmaking
🔎 Similar Papers
No similar papers found.
Z
Zhikai Li
Institute of Automation, Chinese Academy of Sciences
Xuewen Liu
Xuewen Liu
Institute of Automation, Chinese Academy of Sciences
Model compression
D
Dongrong Fu
University of California, Berkeley
J
Jianquan Li
Institute of Automation, Chinese Academy of Sciences
Qingyi Gu
Qingyi Gu
Institute of Automation, Chinese Academy of Sciences
High-speed visioncell analysis
Kurt Keutzer
Kurt Keutzer
Professor of the Graduate School, EECS, University of California, Berkeley
artificial intelligence systemsdeep learningefficient computation
Z
Zhen Dong
University of California, Berkeley