A Survey on Large Language Model Benchmarks

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM benchmarks suffer from pervasive data contamination, cultural-linguistic bias, lack of procedural transparency, and insufficient dynamism, leading to unreliable evaluations. To address this, we conduct the first systematic survey of 283 mainstream LLM benchmarks, proposing a three-dimensional taxonomy—spanning general capabilities, domain-specific competencies, and goal-specific functionalities—that encompasses language understanding, knowledge reasoning, natural sciences, social sciences and humanities, and risk controllability. Through empirical analysis, we uncover structural biases in evaluation objectives, data provenance, and assessment methodologies. Our key contributions are: (1) the first comprehensive benchmark taxonomy map; (2) identification of critical assessment deficiencies; and (3) a novel benchmark design paradigm grounded in trustworthiness, fairness, and adaptability. This work establishes both a theoretical framework and practical guidelines for developing high-fidelity, next-generation LLM evaluation systems.

Technology Category

Application Category

📝 Abstract
In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating inflated scores from data contamination in benchmarks
Addressing unfair cultural and linguistic biases in evaluation
Assessing lack of process credibility and dynamic environment tests
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic categorization of 283 benchmarks
Identified data contamination and bias issues
Proposed design paradigm for future benchmarks
🔎 Similar Papers
No similar papers found.
S
Shiwen Ni
Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Guhong Chen
Guhong Chen
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
LLM、NLP
Guhong Chen
Guhong Chen
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
LLM、NLP
Guhong Chen
Guhong Chen
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
LLM、NLP
Shuaimin Li
Shuaimin Li
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Natural language processingTabular data visualization
Xuanang Chen
Xuanang Chen
Institute of Software, Chinese Academy of Sciences
Information RetrievalNatural Language Processing
S
Siyi Li
Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
S
Siyi Li
University of Science and Technology of China
B
Bingli Wang
Shanghai AI Lab
Qiyao Wang
Qiyao Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsAgentic AIPatent ProcessingAI for IP
Qiyao Wang
Qiyao Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsAgentic AIPatent ProcessingAI for IP
X
Xingjian Wang
Shanghai University of Electric Power
Y
Yifan Zhang
South China University of Technology
L
Liyang Fan
Shenzhen University
C
Chengming Li
Shenzhen MSU-BIT University
Ruifeng Xu
Ruifeng Xu
Professor, Harbin Institute of Technology at Shenzhen
Natural Language ProcessingAffective ComputingArgumentation MiningLLMsBioinformatics
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding