π€ AI Summary
Existing large language models (LLMs) exhibit insufficient correctness and executability in scientific code generation, while mainstream benchmarks overlook the iterative, feedback-driven nature intrinsic to scientific software development. Method: We introduce SciCodeBench, the first 102-task benchmark tailored to scientific research scenarios, featuring a novel five-level feedback hierarchy and a multi-round interactive evaluation protocol that systematically models the dynamics of human-AI collaboration in scientific coding. Our approach integrates structured instructions, unit tests, and the ReCodeAgent iterative framework, leveraging LLM-simulated human feedback to enable closed-loop optimization. Contribution/Results: Experiments demonstrate that rich, structured feedback substantially improves the performance of state-of-the-art LLMs on complex scientific code generation tasks, while also exposing persistent bottlenecks in high-complexity settings. This work establishes a new paradigm and empirical foundation for adaptive intelligent programming agents in scientific domains.
π Abstract
Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation