Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language models’ (LLMs) capability to generate test cases for algorithmic problems remains challenging due to the lack of rigorous, multidimensional assessment criteria. Method: We propose the first quantitative evaluation framework jointly measuring *fault coverage* and *fault exposure*, built upon TestCase-Eval—a novel, large-scale, human-annotated benchmark comprising 500 real Codeforces algorithmic problems and 100,000 manually verified solutions. Our methodology integrates program analysis with differential testing for automated, oracle-free validation of generated test cases. We conduct a systematic evaluation across 19 mainstream open- and closed-source LLMs. Contribution/Results: The study uncovers significant performance gaps—particularly in boundary-case coverage and error localization—across all evaluated models, revealing shared bottlenecks in algorithmic test generation. TestCase-Eval establishes a new paradigm and foundational benchmark for rigorously assessing LLMs’ test-generation competence in algorithmic reasoning tasks.

Technology Category

Application Category

📝 Abstract
We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate high-quality algorithm test cases
Assessing fault coverage in diverse input scenarios
Measuring fault exposure for specific incorrect code implementations
Innovation

Methods, ideas, or system contributions that make the work stand out.

TestCase-Eval benchmark for LLM test-case evaluation
Measures Fault Coverage in diverse input scenarios
Evaluates Fault Exposure with tailored test inputs
🔎 Similar Papers
No similar papers found.
Z
Zheyuan Yang
Tongji University
Z
Zexi Kuang
Northeastern University
Xue Xia
Xue Xia
Pinterest
Y
Yilun Zhao
Yale University