How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

๐Ÿ“… 2025-10-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Evaluating LLM-generated test cases remains challenging due to the lack of a rigorous, diagnostic benchmark that ensures complete fault coverage with minimal test cases. Method: We propose TC-Benchโ€”a minimal, complete, highly diagnostic, and score-inflation-resistant benchmark. We formulate benchmark construction as the optimal diagnostic basis selection problem over a binary codeโ€“test matrix; theoretically prove that the minimum test suite size equals the matrix rank; and design WrongSelect, an efficient algorithm maximizing diversity among failure-inducing inputs. Contribution/Results: Validated on large-scale competitive programming data, TC-Bench exposes critical limitations in current evaluation practices: state-of-the-art methods achieve only ~60% fault exclusion rate on TC-Bench, underscoring the insufficiency of existing benchmarks. TC-Bench thus establishes a new standard for rigorous, diagnosis-oriented evaluation of test-generation capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.
Problem

Research questions and friction points this paper is trying to address.

Determining minimal wrong code set for error space representation
Identifying minimal test cases for complete fault coverage
Developing efficient algorithm to select diverse wrong codes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary matrix framework models code-test relationships
WrongSelect algorithm selects diverse wrong codes efficiently
TC-Bench benchmark provides compact inflation-resistant evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.