RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

πŸ“… 2025-10-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing large language models (LLMs) exhibit insufficient correctness and executability in scientific code generation, while mainstream benchmarks overlook the iterative, feedback-driven nature intrinsic to scientific software development. Method: We introduce SciCodeBench, the first 102-task benchmark tailored to scientific research scenarios, featuring a novel five-level feedback hierarchy and a multi-round interactive evaluation protocol that systematically models the dynamics of human-AI collaboration in scientific coding. Our approach integrates structured instructions, unit tests, and the ReCodeAgent iterative framework, leveraging LLM-simulated human feedback to enable closed-loop optimization. Contribution/Results: Experiments demonstrate that rich, structured feedback substantially improves the performance of state-of-the-art LLMs on complex scientific code generation tasks, while also exposing persistent bottlenecks in high-complexity settings. This work establishes a new paradigm and empirical foundation for adaptive intelligent programming agents in scientific domains.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate correct executable research code
Addressing the lack of iterative feedback in scientific code development
Benchmarking multi-turn interactions with simulated human feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

RECODE-H benchmark evaluates LLM agents with human feedback
ReCodeAgent framework integrates feedback into iterative code generation
Five-level feedback hierarchy simulates realistic researcher-agent collaboration
πŸ”Ž Similar Papers
No similar papers found.
Chunyu Miao
Chunyu Miao
University of Illinois at Chicago
LLMcode generation
Henry Peng Zou
Henry Peng Zou
University of Illinois Chicago
AgentsLarge Language ModelsMultimodal LearningNatural Language Processing
Y
Yangning Li
Tsinghua University
Yankai Chen
Yankai Chen
Postdoctoral Associate, Cornell University
Information RetrievalKnowledge MiningLarge Language ModelsAgentic AI
Y
Yibo Wang
University of Illinois Chicago
F
Fangxin Wang
University of Illinois Chicago
Y
Yifan Li
The Chinese University of Hong Kong
W
Wooseong Yang
University of Illinois Chicago
Bowei He
Bowei He
City University of Hong Kong, MBZUAI
Data MiningLanguage ModelGenAI4ScienceAgentic AI
X
Xinni Zhang
The Chinese University of Hong Kong
D
Dianzhi Yu
The Chinese University of Hong Kong
Hanchen Yang
Hanchen Yang
Georgia Institute of Technology
Computer ArchitectureMachine Learning
H
Hoang H Nguyen
University of Illinois Chicago
Y
Yue Zhou
University of Illinois Chicago
J
Jie Yang
University of Illinois Chicago
Jizhou Guo
Jizhou Guo
Shanghai Jiao Tong University
Large Language ModelsFoundation ModelsNatural Language Processing
Wenzhe Fan
Wenzhe Fan
University of Illinois Chicago
MARLLLM agents
C
Chin-Yuan Yeh
National Taiwan University
P
Panpan Meng
Xi’an Jiaotong University
Liancheng Fang
Liancheng Fang
University of Illinois Chicago
Generative model
Jinhu Qi
Jinhu Qi
PhD candidate in CUHK CSE
Agentic AILLMsReasoning
Wei-Chieh Huang
Wei-Chieh Huang
University of Illinois Chicago
Natural language processing
Zhengyao Gu
Zhengyao Gu
University of Illinois Chicago
NLPGNN
Y
Yuwei Han
University of Illinois Chicago
Langzhou He
Langzhou He
University of Illinois Chicago
Machine LearningLarge Language Model
Y
Yuyao Yang
University of Illinois Chicago
X
Xue Liu
McGill University, MBZUAI
Irwin King
Irwin King
The Chinese University of Hong Kong
social computingmachine learningAIgraph neural networksNLP
Philip S. Yu
Philip S. Yu
Professor of Computer Science, University of Illinons at Chicago
Data miningDatabasePrivacy