Re-evaluating Open-ended Evaluation of Large Language Models

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Elo-based open evaluation methods are vulnerable to prompt redundancy, amplifying inherent biases in data and compromising the stability and fairness of model rankings. This paper reformulates large language model (LLM) open evaluation as a three-player game involving two competing models (A and B) and a shared prompt (P), departing from conventional pairwise comparison paradigms. We introduce the *redundancy-robust game solution*, a theoretically grounded concept that eliminates spurious score inflation caused by duplicate or semantically overlapping prompts. Leveraging this principle, we redesign the Elo update rule to yield a redundancy-resilient open evaluation framework. Experiments demonstrate that our approach significantly improves score stability—reducing variance by 32%—produces model rankings better aligned with human preferences, and uncovers previously overlooked prompt-dependent biases in current mainstream LLM competitions.

Technology Category

Application Category

📝 Abstract
Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.
Problem

Research questions and friction points this paper is trying to address.

Open-ended evaluation of Large Language Models
Bias susceptibility in Elo-based rating systems
Game-theoretic solution for robust evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-ended evaluation systems
3-player game theory
Robustness to redundancy
🔎 Similar Papers
No similar papers found.