🤖 AI Summary
This work addresses the fragility of large language model leaderboards—such as Chatbot Arena—to structured data perturbations, demonstrating that minimal manipulations can substantially alter rankings. We present the first unified framework modeling diverse perturbation types, including Drop, Add, Flip, and contestant removal, and leverage influence function approximations to analyze the stability of the Bradley–Terry model. Our analysis quantifies impacts on top-k rankings, Kendall’s tau correlation, and confidence intervals. We introduce a normalized dataset-level robustness score and employ influence scores for efficient targeted manipulation and active sampling. Experiments across seven real-world leaderboards reveal that perturbing fewer than 1% of comparisons can overturn the top-ranked model, degrade ranking consistency, and distort confidence intervals, with our approach significantly outperforming existing baselines in both manipulation efficiency and uncertainty reduction.
📝 Abstract
Evaluation leaderboards such as LMArena play a central role in benchmarking large language models by aggregating pairwise human preferences into model rankings, yet the robustness of these rankings remains poorly understood. We present a unified perturbation framework for analyzing Bradley-Terry leaderboards under structured data modifications using influence-based approximations. Our framework studies three match-level perturbations -- Drop, Add, and Flip -- together with player removal, and evaluates their effects on top-k membership, global ranking consistency via Kendall's tau, and confidence-interval-based uncertainty. Across Chatbot Arena and six additional pairwise-comparison datasets, we show that modern leaderboards are non-robust across all three objectives: sub-1% targeted perturbations can change the top-ranked model, degrade Kendall's tau, and alter confidence intervals. Beyond robustness auditing, we show that the same influence scores enable efficient targeted perturbations, promoting or demoting specific models and reducing target-model uncertainty with fewer actions than previous manipulation and active-sampling baselines. By summarizing these effects with normalized dataset-level robustness scores, our framework provides a practical and helpful tool for auditing leaderboard stability and motivating more robust evaluation protocols.