FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ (LLMs’) ability to design efficient algorithms for real-world, large-scale optimization problems, often being confined to small-scale or simplified settings. This work proposes FrontierOR—the first large-scale optimization benchmark constructed from top-tier operations research journals—comprising 180 structurally diverse and realistically scaled tasks, accompanied by standardized instances and an expert-validated hidden test set. We assess the algorithm-generation capabilities of seven state-of-the-art open-source LLMs under both single-shot generation and test-time evolution settings. Results show that even the strongest single-shot model outperforms Gurobi on only 31% of tasks; furthermore, despite leveraging a powerful code-centric agent with test-time evolution, success rates on challenging tasks remain at just 50%, highlighting both the significant challenges and untapped potential of LLMs in scalable optimization algorithm design.
📝 Abstract
Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Our FrontierOR Benchmark is available at https://anonymous.4open.science/r/efficient-opt-bench-F03D.
Problem

Research questions and friction points this paper is trying to address.

large-scale optimization
algorithm design
LLM benchmarking
operations research
efficient algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

large-scale optimization
algorithm design
LLM benchmarking
efficient solvers
operations research
🔎 Similar Papers
M
Minwei Kong
Singapore-MIT Alliance for Research and Technology
C
Chonghe Jiang
Singapore-MIT Alliance for Research and Technology, Massachusetts Institute of Technology
Ao Qu
Ao Qu
Massachusetts Institute of Technology
Language AgentMultisensory AIComputational Social Science
W
Wenbin Ouyang
Massachusetts Institute of Technology
Z
Zhaoming Zeng
Northeastern University
Xiaotong Guo
Xiaotong Guo
Ph.D. in Transportation, MIT
Transportation ModelingOptimizationShared MobilityPublic Transit
Zhekai Li
Zhekai Li
Msc in The Chinese University of Hong Kong, Shenzhen
speech synthesissinging voice synthesis
J
Junyi Li
Singapore-MIT Alliance for Research and Technology
Y
Yi Fan
Shanghai Jiaotong University
X
Xinshou Zheng
Boston University
X
Xi Jing
Boston University
Yikai Zhang
Yikai Zhang
Fudan university
Natural Language ProcessingAutonomous Agent
Z
Zhiwei Liang
Emory University
S
Seonghoo Kim
Northwestern University
R
Runqing Yang
Boston University
Z
Zijian Zhou
MiniMax
S
Sirui Li
Microsoft
H
Han Zheng
Massachusetts Institute of Technology
Wangyang Ying
Wangyang Ying
Arizona State University
Data MiningData-Centric AINatural Language ProcessingSpeech Recognition
Ou Zheng
Ou Zheng
Zhiling Research
Computational BiologyData miningArtificial IntelligenceComputer Vision
C
Chonghuan Wang
University of Texas at Dallas
Jinglong Zhao
Jinglong Zhao
Boston University
Operations ResearchEconometricsOnline Platforms
Hanzhang Qin
Hanzhang Qin
Assistant Professor, NUS
Operations ResearchDynamic ProgrammingStatistical LearningSupply Chain Management
Cathy Wu
Cathy Wu
MIT
Machine learningControlOptimizationMulti-agent systemsIntelligent Transportation Systems
P
Paul Pu Liang
Singapore-MIT Alliance for Research and Technology, Massachusetts Institute of Technology