AInsteinBench: Benchmarking Coding Agents on Scientific Repositories

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing benchmarks inadequately assess LLM agents’ scientific development capabilities within real-world research software ecosystems: neither conceptual reasoning nor general programming benchmarks cover end-to-end collaborative evolution of production-grade scientific code. To address this, we propose AInsteinBench—the first LLM agent evaluation benchmark grounded in authentic research software ecosystems. It targets six domains—including quantum chemistry and molecular dynamics—using maintenance-level Pull Requests from widely adopted open-source libraries. Tasks execute in reproducible sandbox environments and integrate test-driven validation, scientific-semantic failure analysis, unit test coverage measurement, and difficulty calibration. Innovatively anchoring evaluation in production codebases, AInsteinBench introduces expert review and multi-stage filtering to define and quantify core competencies of scientific computing agents: domain-knowledge integration, numerical robustness, and collaborative code evolution. Our systematic evaluation exposes critical weaknesses of leading code-generation agents across these dimensions.

Technology Category

Application Category

📝 Abstract

We introduce AInsteinBench, a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific reasoning benchmarks which focus on conceptual knowledge, or software engineering benchmarks that emphasize generic feature implementation and issue resolving, AInsteinBench evaluates models in end-to-end scientific development settings grounded in production-grade scientific repositories. The benchmark consists of tasks derived from maintainer-authored pull requests across six widely used scientific codebases, spanning quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics. All benchmark tasks are carefully curated through multi-stage filtering and expert review to ensure scientific challenge, adequate test coverage, and well-calibrated difficulty. By leveraging evaluation in executable environments, scientifically meaningful failure modes, and test-driven verification, AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM agents in real scientific software ecosystems

Measures ability to perform end-to-end scientific development tasks

Assesses competencies beyond surface-level code generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates LLM agents in scientific repositories

Tasks derived from pull requests in six codebases

Uses executable environments and test-driven verification

🔎 Similar Papers

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery