🤖 AI Summary
Existing large language model (LLM) benchmarks for code lack systematic construction guidelines, resulting in poor data quality, incomplete open-sourcing, sample duplication, and sensitive information leakage—severely undermining evaluation validity and reproducibility.
Method: We propose How2Bench—the first comprehensive, fine-grained, actionable guideline for code benchmark development across the full lifecycle, comprising 55 evaluation criteria. It is grounded in a benchmark quality analysis framework, a metadata census of 274 code benchmarks published over the past decade, and an empirical human study involving 49 practitioners.
Contribution/Results: Our analysis reveals that ~70% of existing benchmarks lack adequate data quality assurance and >10% are not fully open-sourced. How2Bench significantly enhances defect detection capability and promotes a community-wide paradigm shift toward high-quality, transparent, and reproducible benchmark construction.
📝 Abstract
Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55- 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.