How Should I Build A Benchmark?

📅 2025-01-18

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing large language model (LLM) benchmarks for code lack systematic construction guidelines, resulting in poor data quality, incomplete open-sourcing, sample duplication, and sensitive information leakage—severely undermining evaluation validity and reproducibility. Method: We propose How2Bench—the first comprehensive, fine-grained, actionable guideline for code benchmark development across the full lifecycle, comprising 55 evaluation criteria. It is grounded in a benchmark quality analysis framework, a metadata census of 274 code benchmarks published over the past decade, and an empirical human study involving 49 practitioners. Contribution/Results: Our analysis reveals that ~70% of existing benchmarks lack adequate data quality assurance and >10% are not fully open-sourced. How2Bench significantly enhances defect detection capability and promotes a community-wide paradigm shift toward high-quality, transparent, and reproducible benchmark construction.

Technology Category

Application Category

📝 Abstract

Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55- 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.

Problem

Research questions and friction points this paper is trying to address.

Large Model Evaluation

Programming Test Quality

Transparency and Reproducibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

How2Bench

Programming Test Evaluation

Quality Assurance

🔎 Similar Papers

No similar papers found.