🤖 AI Summary
Existing benchmarks inadequately assess AI systems’ capacity for sustained, goal-directed iterative optimization—critical for NP-hard, long-horizon planning tasks (e.g., package routing, crew scheduling)—lacking evaluation of progressive improvement and long-term stability. Method: We introduce the first benchmark centered on score-oriented, long-horizon iterative optimization, grounded in real AtCoder programming contest problems. It integrates an interactive agent framework, real-time feedback mechanisms, and visualization tools for fine-grained analysis. Contribution/Results: Unlike conventional pass/fail short-horizon coding benchmarks, our paradigm systematically evaluates cross-problem consistency, long-term robustness, and incremental solution refinement. Empirical results show that state-of-the-art large language models achieve high single-task performance but significantly underperform human experts in cross-task generalization and long-horizon robustness—demonstrating the benchmark’s validity and rigor.
📝 Abstract
How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.