Evaluating Agentic Optimization on Large Codebases

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing code benchmarks predominantly rely on synthetic tasks or single-objective evaluations, which inadequately capture the ability of large language models (LLMs) to perform multi-objective optimization within real-world, large-scale codebases. To address this gap, this work introduces FormulaCode—the first fine-grained, multi-objective code optimization benchmark grounded in authentic scientific Python projects. Constructed from 957 GitHub-identified performance bottlenecks, expert-authored repair patches, and an average of 264.6 community-maintained performance workloads per project, FormulaCode enables systematic evaluation of LLM agents’ holistic optimization behavior under repository-level constraints. Experimental results reveal that even state-of-the-art LLMs struggle significantly on such tasks, underscoring the critical need for further research in this direction.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io

Problem

Research questions and friction points this paper is trying to address.

code optimization

large language models

code benchmarks

multi-objective evaluation

repository-level coding

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic optimization

multi-objective evaluation

large codebases