Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Evaluating large language models’ (LLMs) capability to generate test cases for algorithmic problems remains challenging due to the lack of rigorous, multidimensional assessment criteria. Method: We propose the first quantitative evaluation framework jointly measuring *fault coverage* and *fault exposure*, built upon TestCase-Eval—a novel, large-scale, human-annotated benchmark comprising 500 real Codeforces algorithmic problems and 100,000 manually verified solutions. Our methodology integrates program analysis with differential testing for automated, oracle-free validation of generated test cases. We conduct a systematic evaluation across 19 mainstream open- and closed-source LLMs. Contribution/Results: The study uncovers significant performance gaps—particularly in boundary-case coverage and error localization—across all evaluated models, revealing shared bottlenecks in algorithmic test generation. TestCase-Eval establishes a new paradigm and foundational benchmark for rigorously assessing LLMs’ test-generation competence in algorithmic reasoning tasks.

Technology Category

Application Category

📝 Abstract

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate high-quality algorithm test cases

Assessing fault coverage in diverse input scenarios

Measuring fault exposure for specific incorrect code implementations

Innovation

Methods, ideas, or system contributions that make the work stand out.

TestCase-Eval benchmark for LLM test-case evaluation

Measures Fault Coverage in diverse input scenarios

Evaluates Fault Exposure with tailored test inputs

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation