Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work investigates the robustness of mainstream large language model (LLM) ranking systems—based on the Bradley–Terry model—to perturbations in preference data. Focusing on Chatbot Arena and MT-Bench, we propose an efficient worst-case data removal method to identify critical preference samples: eliminating merely 0.02% of evaluation data suffices to invert top-model rankings, exposing the high sensitivity of current rankings to a small set of discriminative comparisons. We find analogous fragility in both LLM-as-a-judge and human evaluations, while MT-Bench demonstrates greater stability due to expert annotation and refined prompt engineering. This study provides the first systematic quantification of the sensitivity boundary of preference-based ranking and introduces a scalable robustness diagnostic framework. Our results offer theoretical grounding and practical tools for building more reliable and trustworthy LLM evaluation infrastructures.

Technology Category

Application Category

📝 Abstract

We propose a method for evaluating the robustness of a widely used LLM ranking system -- the Bradley--Terry ranking system -- to dropping a worst-case very small fraction of evaluation data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from two popular human-preference platforms, Chatbot Arena and MT-Bench, we find that the Bradley--Terry rankings of top-performing models are remarkably sensitive to the removal of a small fraction of evaluations. Our framework also identifies the specific evaluations most responsible for such ranking flips, allowing for inspections of these influential preferences. We observe that the rankings derived from MT-Bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use of expert annotators and carefully constructed prompts. Finally, we find that rankings based on crowdsourced human-evaluated systems are just as sensitive as those based on LLM-as-a-judge evaluations, where in both, dropping as little as 0.02% of the total evaluations in the dataset can change the top-ranked model.

Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness of LLM ranking systems to data removal

Identifying influential preferences causing ranking flips

Comparing sensitivity between human and LLM evaluation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Robustness evaluation method for LLM rankings

Identifies influential preferences causing ranking flips

Computationally fast framework for sensitivity analysis

🔎 Similar Papers

No similar papers found.