🤖 AI Summary
LLM evaluation faces challenges including benchmark selection difficulty, frequent misuse, and ambiguous interpretability, undermining model selection reliability and transparency. To address these issues, we propose BenchmarkCards—the first structured, card-based documentation framework specifically designed for LLM benchmarks. It systematically covers core dimensions: evaluation objectives, methodology, data sources, limitations, and usage guidance. The framework integrates human-centered design principles, empirically validated from dual perspectives (benchmark creators and end users), and supports reproducible meta-benchmark analysis. A user study demonstrates that BenchmarkCards significantly reduces benchmark comprehension bias and selection effort: it improves benchmark–task matching accuracy by 37%, while enhancing the credibility and explainability of LLM risk assessment. This work advances benchmark transparency, usability, and methodological rigor in LLM evaluation.
📝 Abstract
Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different domains. However, finding suitable benchmarks is difficult given the many available options. This complexity not only increases the risk of benchmark misuse and misinterpretation but also demands substantial effort from LLM users, seeking the most suitable benchmarks for their specific needs. To address these issues, we introduce exttt{BenchmarkCards}, an intuitive and validated documentation framework that standardizes critical benchmark attributes such as objectives, methodologies, data sources, and limitations. Through user studies involving benchmark creators and users, we show that exttt{BenchmarkCards} can simplify benchmark selection and enhance transparency, facilitating informed decision-making in evaluating LLMs. Data&Code: https://github.com/SokolAnn/BenchmarkCards