RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing large language models (LLMs) exhibit insufficient correctness and executability in scientific code generation, while mainstream benchmarks overlook the iterative, feedback-driven nature intrinsic to scientific software development. Method: We introduce SciCodeBench, the first 102-task benchmark tailored to scientific research scenarios, featuring a novel five-level feedback hierarchy and a multi-round interactive evaluation protocol that systematically models the dynamics of human-AI collaboration in scientific coding. Our approach integrates structured instructions, unit tests, and the ReCodeAgent iterative framework, leveraging LLM-simulated human feedback to enable closed-loop optimization. Contribution/Results: Experiments demonstrate that rich, structured feedback substantially improves the performance of state-of-the-art LLMs on complex scientific code generation tasks, while also exposing persistent bottlenecks in high-complexity settings. This work establishes a new paradigm and empirical foundation for adaptive intelligent programming agents in scientific domains.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate correct executable research code

Addressing the lack of iterative feedback in scientific code development

Benchmarking multi-turn interactions with simulated human feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

RECODE-H benchmark evaluates LLM agents with human feedback

ReCodeAgent framework integrates feedback into iterative code generation

Five-level feedback hierarchy simulates realistic researcher-agent collaboration

🔎 Similar Papers

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models