🤖 AI Summary
Evaluating large language models’ (LLMs) capability to generate test cases for algorithmic problems remains challenging due to the lack of rigorous, multidimensional assessment criteria.
Method: We propose the first quantitative evaluation framework jointly measuring *fault coverage* and *fault exposure*, built upon TestCase-Eval—a novel, large-scale, human-annotated benchmark comprising 500 real Codeforces algorithmic problems and 100,000 manually verified solutions. Our methodology integrates program analysis with differential testing for automated, oracle-free validation of generated test cases. We conduct a systematic evaluation across 19 mainstream open- and closed-source LLMs.
Contribution/Results: The study uncovers significant performance gaps—particularly in boundary-case coverage and error localization—across all evaluated models, revealing shared bottlenecks in algorithmic test generation. TestCase-Eval establishes a new paradigm and foundational benchmark for rigorously assessing LLMs’ test-generation competence in algorithmic reasoning tasks.
📝 Abstract
We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.