ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess AI systems’ capacity for sustained, goal-directed iterative optimization—critical for NP-hard, long-horizon planning tasks (e.g., package routing, crew scheduling)—lacking evaluation of progressive improvement and long-term stability. Method: We introduce the first benchmark centered on score-oriented, long-horizon iterative optimization, grounded in real AtCoder programming contest problems. It integrates an interactive agent framework, real-time feedback mechanisms, and visualization tools for fine-grained analysis. Contribution/Results: Unlike conventional pass/fail short-horizon coding benchmarks, our paradigm systematically evaluates cross-problem consistency, long-term robustness, and incremental solution refinement. Empirical results show that state-of-the-art large language models achieve high single-task performance but significantly underperform human experts in cross-task generalization and long-horizon robustness—demonstrating the benchmark’s validity and rigor.

Technology Category

Application Category

📝 Abstract
How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI performance in long-horizon algorithm engineering tasks
Benchmarking AI on hard optimization problems without exact solutions
Assessing iterative solution refinement capabilities of AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

ALE-Bench evaluates AI on hard optimization problems
Supports iterative refinement with test-run feedback
Uses interactive agent architectures with visualizations
🔎 Similar Papers
No similar papers found.
Yuki Imajuku
Yuki Imajuku
Sakana AI
K
Kohki Horie
The University of Tokyo, Japan
Yoichi Iwata
Yoichi Iwata
AtCoder, Japan
K
Kensho Aoki
AtCoder, Japan
N
Naohiro Takahashi
AtCoder, Japan
Takuya Akiba
Takuya Akiba
Sakana AI
Deep LearningMachine Learning