ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

📅 2025-02-27
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation benchmarks fail to model diverse multi-turn feedback—such as compilation errors, execution outcomes, and natural-language critiques—limiting rigorous evaluation of LLMs in conversational programming. Method: We introduce ConvCodeBench (static) and ConvCodeWorld (dynamic), the first reproducible multi-turn feedback evaluation benchmarks, establishing a feedback-driven evaluation paradigm. Leveraging GPT-4o, we generate structured natural-language feedback; integrate compiler-based error simulation; and deploy a coverage-aware execution engine to emulate realistic developer interactions. Benchmark consistency is validated via Spearman correlation. Contribution/Results: Experiments reveal that feedback type and intensity critically affect model adaptability: weaker models can surpass stronger ones’ single-turn performance after multiple feedback rounds, yet feedback-combination specificity induces generalization bottlenecks. Moreover, a trade-off exists between Mean Reciprocal Rank (MRR) and Recall, highlighting inherent limitations in current feedback-integration strategies.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; (c) verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce CONVCODEBENCH, a fast, static version of benchmark that uses pre-generated feedback logs, eliminating the need for costly dynamic verbal feedback generation while maintaining strong Spearman's rank correlations (0.82 to 0.99) with CONVCODEWORLD. Third, extensive evaluations of both closed-source and open-source LLMs including R1-Distill on CONVCODEWORLD reveal key insights: (a) LLM performance varies significantly based on the feedback provided; (b) Weaker LLMs, with sufficient feedback, can outperform single-turn results of state-of-the-art LLMs without feedback; (c) Training on a specific feedback combination can limit an LLM's ability to utilize unseen combinations; (d) LLMs solve problems in fewer turns (high MRR) may not solve as many problems overall (high Recall), and vice versa. All implementations and benchmarks will be made publicly available at https://huggingface.co/spaces/ConvCodeWorld/ConvCodeWorld
Problem

Research questions and friction points this paper is trying to address.

Benchmarking conversational code generation in interactive settings
Modeling diverse feedback for evaluating LLMs
Introducing reproducible environments for code generation scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CONVCODEWORLD for interactive benchmarking
Develops CONVCODEBENCH with pre-generated feedback logs
Evaluates LLMs with diverse feedback combinations
🔎 Similar Papers
No similar papers found.