Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work exposes severe manipulability vulnerabilities in mainstream automated LLM benchmarks—including AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench—where a “null model” outputting fixed, task-irrelevant responses achieves scores far exceeding most open-source models (e.g., 86.5% LC win rate on AlpacaEval 2.0). Methodologically, the authors propose a systematic evaluation framework based on constant-response construction, statistical win-rate analysis, and cross-benchmark validation; they further demonstrate that such null responses can be transferred across tasks and benchmarks without access to private test instructions, achieving state-of-the-art scores. Crucially, they also investigate the feasibility of using LLMs to generate covert cheating responses. The study reveals a critical reliability crisis in automated LLM evaluation and provides both a foundational methodological warning and concrete guidance for developing manipulation-resistant, robust benchmarking systems.

Technology Category

Application Category

📝 Abstract

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a"null model"that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Null models cheat automatic LLM benchmarks.

Cheating outputs achieve high win rates.

Anti-cheating mechanisms needed for reliable benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Null model achieves high benchmark win rates

Cheating outputs transferable without benchmark access

Calls for anti-cheating mechanisms in benchmarks

🔎 Similar Papers

No similar papers found.