The Leaderboard Illusion

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

256K/year

🤖 AI Summary

This paper identifies systemic biases and ranking distortions in the Chatbot Arena leaderboard, arising from private testing, selective score disclosure, and unequal data access. Methodologically, it conducts the first quantitative analysis of public/private test logs, sampling rate disparities, and data allocation patterns—employing statistical modeling and distributional attribution techniques. Results reveal that Meta privately evaluated 27 Llama variants; Google and OpenAI collectively received ~39.6% of Arena data (19.2% and 20.4%, respectively), while 83 open-source models shared only 29.7%; additional data access yields up to a 112% relative win-rate improvement. The findings demonstrate that current rankings reflect overfitting to Arena’s evaluation dynamics rather than genuine capability advances. This work provides critical empirical evidence and methodological advances for enhancing transparency and fairness in AI evaluation.

Technology Category

Application Category

📝 Abstract

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

Problem

Research questions and friction points this paper is trying to address.

Undisclosed private testing biases AI leaderboard rankings

Selective disclosure of performance results distorts model comparisons

Data access asymmetries favor proprietary over open-source models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exposed private testing biases in leaderboard rankings

Identified selective disclosure of performance results

Proposed reforms for fairer benchmarking practices

🔎 Similar Papers

No similar papers found.