AInsteinBench: Benchmarking Coding Agents on Scientific Repositories

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess LLM agents’ scientific development capabilities within real-world research software ecosystems: neither conceptual reasoning nor general programming benchmarks cover end-to-end collaborative evolution of production-grade scientific code. To address this, we propose AInsteinBench—the first LLM agent evaluation benchmark grounded in authentic research software ecosystems. It targets six domains—including quantum chemistry and molecular dynamics—using maintenance-level Pull Requests from widely adopted open-source libraries. Tasks execute in reproducible sandbox environments and integrate test-driven validation, scientific-semantic failure analysis, unit test coverage measurement, and difficulty calibration. Innovatively anchoring evaluation in production codebases, AInsteinBench introduces expert review and multi-stage filtering to define and quantify core competencies of scientific computing agents: domain-knowledge integration, numerical robustness, and collaborative code evolution. Our systematic evaluation exposes critical weaknesses of leading code-generation agents across these dimensions.

Technology Category

Application Category

📝 Abstract
We introduce AInsteinBench, a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific reasoning benchmarks which focus on conceptual knowledge, or software engineering benchmarks that emphasize generic feature implementation and issue resolving, AInsteinBench evaluates models in end-to-end scientific development settings grounded in production-grade scientific repositories. The benchmark consists of tasks derived from maintainer-authored pull requests across six widely used scientific codebases, spanning quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics. All benchmark tasks are carefully curated through multi-stage filtering and expert review to ensure scientific challenge, adequate test coverage, and well-calibrated difficulty. By leveraging evaluation in executable environments, scientifically meaningful failure modes, and test-driven verification, AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM agents in real scientific software ecosystems
Measures ability to perform end-to-end scientific development tasks
Assesses competencies beyond surface-level code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates LLM agents in scientific repositories
Tasks derived from pull requests in six codebases
Uses executable environments and test-driven verification
🔎 Similar Papers
No similar papers found.
T
Titouan Duston
ByteDance Seed
S
Shuo Xin
ByteDance Seed
Y
Yang Sun
ByteDance Seed
Daoguang Zan
Daoguang Zan
ByteDance Seed
Large Language ModelSoftware EngineeringCoding Agent
A
Aoyan Li
ByteDance Seed
S
Shulin Xin
ByteDance Seed
Kai Shen
Kai Shen
Associate Professor of Computer Science, University of Rochester
Computer Systems
Y
Yixiao Chen
ByteDance Seed
Qiming Sun
Qiming Sun
California Institute of Technology
Theoretical chemistry and physics
G
Ge Zhang
ByteDance Seed
Jiashuo Liu
Jiashuo Liu
Tsinghua University
Robust OptimizationOOD GeneralizationData-Centric AI
Huan Zhou
Huan Zhou
Northwestern Polytechnical University
Mobile Edge ComputingFederated LearningMobile Social NetworksVANETsData Offloading
J
Jingkai Liu
ByteDance Seed
Z
Zhichen Pu
ByteDance Seed
Yuanheng Wang
Yuanheng Wang
University of North Carolina - Wilmington
Applied LinguisticsEnglish for Academic PurposesCorpus LinguisticsGenre
B
Bo-Xuan Ge
ByteDance Seed
X
Xin Tong
ByteDance Seed
F
Fei Ye
ByteDance Seed
Z
Zhi-Chao Zhao
ByteDance Seed
W
Wen-Biao Han
ByteDance Seed
Z
Zhoujian Cao
ByteDance Seed
Y
Yueran Zhao
ByteDance Seed
W
Weiluo Ren
ByteDance Seed
Q
Qingshen Long
ByteDance Seed
Yuxiao Liu
Yuxiao Liu
ShanghaiTech University
fMRIneuroscienceNLPLarge Language Model