🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ (LLMs’) ability to design efficient algorithms for real-world, large-scale optimization problems, often being confined to small-scale or simplified settings. This work proposes FrontierOR—the first large-scale optimization benchmark constructed from top-tier operations research journals—comprising 180 structurally diverse and realistically scaled tasks, accompanied by standardized instances and an expert-validated hidden test set. We assess the algorithm-generation capabilities of seven state-of-the-art open-source LLMs under both single-shot generation and test-time evolution settings. Results show that even the strongest single-shot model outperforms Gurobi on only 31% of tasks; furthermore, despite leveraging a powerful code-centric agent with test-time evolution, success rates on challenging tasks remain at just 50%, highlighting both the significant challenges and untapped potential of LLMs in scalable optimization algorithm design.
📝 Abstract
Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Our FrontierOR Benchmark is available at https://anonymous.4open.science/r/efficient-opt-bench-F03D.