Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes severe manipulability vulnerabilities in mainstream automated LLM benchmarks—including AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench—where a “null model” outputting fixed, task-irrelevant responses achieves scores far exceeding most open-source models (e.g., 86.5% LC win rate on AlpacaEval 2.0). Methodologically, the authors propose a systematic evaluation framework based on constant-response construction, statistical win-rate analysis, and cross-benchmark validation; they further demonstrate that such null responses can be transferred across tasks and benchmarks without access to private test instructions, achieving state-of-the-art scores. Crucially, they also investigate the feasibility of using LLMs to generate covert cheating responses. The study reveals a critical reliability crisis in automated LLM evaluation and provides both a foundational methodological warning and concrete guidance for developing manipulation-resistant, robust benchmarking systems.

Technology Category

Application Category

📝 Abstract
Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a"null model"that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Null models cheat automatic LLM benchmarks.
Cheating outputs achieve high win rates.
Anti-cheating mechanisms needed for reliable benchmarks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Null model achieves high benchmark win rates
Cheating outputs transferable without benchmark access
Calls for anti-cheating mechanisms in benchmarks
🔎 Similar Papers
No similar papers found.
Xiaosen Zheng
Xiaosen Zheng
Researcher @ TikTok
Code AIData-Centric AI
T
Tianyu Pang
Sea AI Lab, Singapore
C
Chao Du
Sea AI Lab, Singapore
Q
Qian Liu
Sea AI Lab, Singapore
J
Jing Jiang
Singapore Management University
Min Lin
Min Lin
Principal Research Scientist, Sea AI Lab
Artificial Intelligence