How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the vulnerability of multitask benchmark leaderboards to strategic manipulation via targeted training. For the first time, it introduces computational social choice theory into benchmark robustness analysis by modeling datasets as voters and models as candidates, framing such manipulation as a shift bribery problem. The authors propose an instance-level robustness metric and prove that manipulation is NP-hard under both Borda scoring and average win-rate aggregation rules. Empirical results demonstrate that the average win-rate rule is the most resistant to manipulation: on BIG-Bench Hard, it achieves a median robustness of 22 out of 24 tasks (92%), substantially outperforming other aggregation methods such as arithmetic mean, median, and pairwise majority.

📝 Abstract

Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming -- strategic actions taken to improve the leaderboard rank of a specific model. Treating datasets as voters and models as candidates, we consider benchmark-specific training -- the inclusion of benchmark data in training -- as a form of election manipulation. For any ordinal benchmark, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery, a class of manipulation problems from computational social choice. Leveraging this identification, we show that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. Complementing this worst-case perspective, we introduce the instance-level robustness, the minimum number of datasets a model developer must include in training to top a given leaderboard, and derive expressions for it under arithmetic mean, median, mean win rate and pairwise majority. We evaluate these expressions on MMLU under HELM and on BIG-Bench Hard (BBH) under the Open LLM Leaderboard. Across both suites, mean win rate is hardest to manipulate: this gap is clear on BBH (24 tasks, 4507 models), where its median robustness is 22 tasks (92%), compared with 13 (54%) under arithmetic mean and 12 (50%) under median and pairwise majority.

Problem

Research questions and friction points this paper is trying to address.

benchmark gaming

leaderboard robustness

shift bribery

computational social choice

multi-task benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

shift bribery

benchmark manipulation

instance-level robustness