PRBench: End-to-end Paper Reproduction in Physics Research

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes PRBench—the first standardized benchmark for end-to-end reproduction of real physics journal papers—encompassing 11 subfields and 30 expert-curated tasks that require AI agents to fully replicate research workflows solely from paper content within isolated environments, including interpreting methodologies, implementing algorithms, and reproducing quantitative results. Developed with validation tasks and scoring criteria contributed by over 20 Peking University physics research groups, PRBench features an agent-based evaluation pipeline that holistically assesses capabilities in scientific reasoning, symbolic derivation, code generation, and numerical simulation. Experiments reveal that even the best-performing model (GPT-5.3-Codex) achieves only a 34% average score, with zero success rate across all agents in end-to-end reproduction, exposing systemic failures such as incorrect formula implementation, debugging breakdowns, and synthetic data fabrication.
📝 Abstract
AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.
Problem

Research questions and friction points this paper is trying to address.

paper reproduction
AI agents
scientific benchmark
physics research
end-to-end evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

paper reproduction
AI agent
scientific benchmark
physics research
code generation
🔎 Similar Papers
No similar papers found.
Shi Qiu
Shi Qiu
Peking University
MultimodalityLLM EvaluationNLP
J
Junyi Deng
School of Physics, Peking University, China
Y
Yiwei Deng
School of Physics, Peking University, China
H
Haoran Dong
School of Physics, Peking University, China
J
Jieyu Fu
School of Physics, Peking University, China
Mao Li
Mao Li
Professor of chemistry, Jilin University, China
sequence controlled polymerizationtopology controlled polymerizationelectropolymerization
Z
Zeyu Li
School of Physics, Peking University, China
Z
Zhaolong Zhang
School of Physics, Peking University, China
H
Huiwen Zheng
School of Physics, Peking University, China
L
Leidong Bao
School of Physics, Peking University, China
A
Anqi Lv
School of Physics, Peking University, China
Z
Zihan Mo
School of Physics, Peking University, China
Y
Yadi Niu
School of Physics, Peking University, China
Yiyang Peng
Yiyang Peng
Imperial College London
Wireless Communications
Y
Yu Tian
School of Physics, Peking University, China
Yili Wang
Yili Wang
Jilin University
Graph Neural Networks
Z
Ziyu Wang
School of Physics, Peking University, China
Z
Zi-Yu Wang
School of Physics, Peking University, China
J
Jiashen Wei
School of Physics, Peking University, China
L
Liuheng Wu
School of Physics, Peking University, China
A
Aoran Xue
School of Physics, Peking University, China
L
Leyi Yang
School of Physics, Peking University, China
G
Guanglu Yuan
School of Physics, Peking University, China
X
Xiarui Zhan
School of Physics, Peking University, China
J
Jingjun Zhang
School of Physics, Peking University, China