Scaling Test-Driven Code Generation from Functions to Classes: An Empirical Study

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

134K/year

🤖 AI Summary

Existing test-driven code generation approaches are largely confined to the function level and struggle to handle the complexity of class-level code, where multiple methods interact through shared state and invocation dependencies. This work proposes an iterative test-driven framework that addresses this challenge by analyzing intra-class method dependencies to determine a synthesis order, then progressively generating complete class implementations through a combination of public test execution, reflective execution feedback, and bounded repair iterations. The study presents the first effective extension of test-driven program synthesis to the class level and introduces ClassEval-TDD, a standardized benchmark for evaluation. Experiments across eight large language models demonstrate that the proposed method improves class-level pass rates by 12–26 percentage points, achieving up to 71% full correctness, with only a few repair iterations required on average.

Technology Category

Application Category

📝 Abstract

Test-driven development (TDD) has been adopted to improve Large Language Model (LLM)-based code generation by using tests as executable specifications. However, existing TDD-style code generation studies are largely limited to function-level tasks, leaving class-level synthesis where multiple methods interact through shared state and call dependencies underexplored. In this paper, we scale test-driven code generation from functions to classes via an iterative TDD framework. Our approach first analyzes intra-class method dependencies to derive a feasible generation schedule, and then incrementally implements each method under method-level public tests with reflection-style execution feedback and bounded repair iterations. To support test-driven generation and rigorous class-level evaluation, we construct ClassEval-TDD, a cleaned and standardized variant of ClassEval with consistent specifications, deterministic test environments, and complete method-level public tests. We conduct an empirical study across eight LLMs and compare against the strongest direct-generation baseline (the best of holistic, incremental, and compositional strategies). Our class-level TDD framework consistently improves class-level correctness by 12 to 26 absolute points and achieves up to 71% fully correct classes, while requiring only a small number of repairs on average. These results demonstrate that test-driven generation can effectively scale beyond isolated functions and substantially improve class-level code generation reliability. All code and data are available at https://anonymous.4open.science/r/ClassEval-TDD-C4C9/

Problem

Research questions and friction points this paper is trying to address.

test-driven development

class-level code generation

large language models

method dependencies

executable specifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-driven development

class-level code generation

method dependency analysis