🤖 AI Summary
Existing benchmarks inadequately assess LLM agents’ scientific development capabilities within real-world research software ecosystems: neither conceptual reasoning nor general programming benchmarks cover end-to-end collaborative evolution of production-grade scientific code. To address this, we propose AInsteinBench—the first LLM agent evaluation benchmark grounded in authentic research software ecosystems. It targets six domains—including quantum chemistry and molecular dynamics—using maintenance-level Pull Requests from widely adopted open-source libraries. Tasks execute in reproducible sandbox environments and integrate test-driven validation, scientific-semantic failure analysis, unit test coverage measurement, and difficulty calibration. Innovatively anchoring evaluation in production codebases, AInsteinBench introduces expert review and multi-stage filtering to define and quantify core competencies of scientific computing agents: domain-knowledge integration, numerical robustness, and collaborative code evolution. Our systematic evaluation exposes critical weaknesses of leading code-generation agents across these dimensions.
📝 Abstract
We introduce AInsteinBench, a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific reasoning benchmarks which focus on conceptual knowledge, or software engineering benchmarks that emphasize generic feature implementation and issue resolving, AInsteinBench evaluates models in end-to-end scientific development settings grounded in production-grade scientific repositories. The benchmark consists of tasks derived from maintainer-authored pull requests across six widely used scientific codebases, spanning quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics. All benchmark tasks are carefully curated through multi-stage filtering and expert review to ensure scientific challenge, adequate test coverage, and well-calibrated difficulty. By leveraging evaluation in executable environments, scientifically meaningful failure modes, and test-driven verification, AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.