🤖 AI Summary
Existing test-driven code generation approaches are largely confined to the function level and struggle to handle the complexity of class-level code, where multiple methods interact through shared state and invocation dependencies. This work proposes an iterative test-driven framework that addresses this challenge by analyzing intra-class method dependencies to determine a synthesis order, then progressively generating complete class implementations through a combination of public test execution, reflective execution feedback, and bounded repair iterations. The study presents the first effective extension of test-driven program synthesis to the class level and introduces ClassEval-TDD, a standardized benchmark for evaluation. Experiments across eight large language models demonstrate that the proposed method improves class-level pass rates by 12–26 percentage points, achieving up to 71% full correctness, with only a few repair iterations required on average.
📝 Abstract
Test-driven development (TDD) has been adopted to improve Large Language Model (LLM)-based code generation by using tests as executable specifications. However, existing TDD-style code generation studies are largely limited to function-level tasks, leaving class-level synthesis where multiple methods interact through shared state and call dependencies underexplored. In this paper, we scale test-driven code generation from functions to classes via an iterative TDD framework. Our approach first analyzes intra-class method dependencies to derive a feasible generation schedule, and then incrementally implements each method under method-level public tests with reflection-style execution feedback and bounded repair iterations. To support test-driven generation and rigorous class-level evaluation, we construct ClassEval-TDD, a cleaned and standardized variant of ClassEval with consistent specifications, deterministic test environments, and complete method-level public tests. We conduct an empirical study across eight LLMs and compare against the strongest direct-generation baseline (the best of holistic, incremental, and compositional strategies). Our class-level TDD framework consistently improves class-level correctness by 12 to 26 absolute points and achieves up to 71% fully correct classes, while requiring only a small number of repairs on average. These results demonstrate that test-driven generation can effectively scale beyond isolated functions and substantially improve class-level code generation reliability. All code and data are available at https://anonymous.4open.science/r/ClassEval-TDD-C4C9/