Re-evaluating Open-ended Evaluation of Large Language Models

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing Elo-based open evaluation methods are vulnerable to prompt redundancy, amplifying inherent biases in data and compromising the stability and fairness of model rankings. This paper reformulates large language model (LLM) open evaluation as a three-player game involving two competing models (A and B) and a shared prompt (P), departing from conventional pairwise comparison paradigms. We introduce the *redundancy-robust game solution*, a theoretically grounded concept that eliminates spurious score inflation caused by duplicate or semantically overlapping prompts. Leveraging this principle, we redesign the Elo update rule to yield a redundancy-resilient open evaluation framework. Experiments demonstrate that our approach significantly improves score stability—reducing variance by 32%—produces model rankings better aligned with human preferences, and uncovers previously overlooked prompt-dependent biases in current mainstream LLM competitions.

Technology Category

Application Category

📝 Abstract

Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

Problem

Research questions and friction points this paper is trying to address.

Open-ended evaluation of Large Language Models

Bias susceptibility in Elo-based rating systems

Game-theoretic solution for robust evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-ended evaluation systems

3-player game theory

Robustness to redundancy

🔎 Similar Papers

No similar papers found.