ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing evaluation frameworks struggle to effectively assess large language models’ capabilities in class-level compositional code generation and lack scalable, contamination-resistant high-quality benchmarks. This work introduces the first such benchmark, encompassing 300 tasks across 11 domains, constructed via an automated three-stage pipeline that integrates cross-domain structural patterns and incorporates real GitHub code committed after January 2025. All tasks are validated by an LLM review panel, with test suites achieving over 90% line coverage. Systematic evaluation reveals that even the best model attains only a 45.6% Pass@1 score, with performance gaps between models reaching 17.7 percentage points. Structured generation strategies improve weaker models by up to 9.4%, though gains from compositional approaches remain limited. Error analysis identifies logical errors (56.2%) and dependency-related issues (38.0%) as primary bottlenecks.

📝 Abstract

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.

Problem

Research questions and friction points this paper is trying to address.

class-level code generation

compositional code creation

cross-domain benchmark

LLM evaluation

code synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

class-level code generation

cross-domain benchmark

compositional code creation