ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing benchmarks inadequately assess AI systems’ capacity for sustained, goal-directed iterative optimization—critical for NP-hard, long-horizon planning tasks (e.g., package routing, crew scheduling)—lacking evaluation of progressive improvement and long-term stability. Method: We introduce the first benchmark centered on score-oriented, long-horizon iterative optimization, grounded in real AtCoder programming contest problems. It integrates an interactive agent framework, real-time feedback mechanisms, and visualization tools for fine-grained analysis. Contribution/Results: Unlike conventional pass/fail short-horizon coding benchmarks, our paradigm systematically evaluates cross-problem consistency, long-term robustness, and incremental solution refinement. Empirical results show that state-of-the-art large language models achieve high single-task performance but significantly underperform human experts in cross-task generalization and long-horizon robustness—demonstrating the benchmark’s validity and rigor.

Technology Category

Application Category

📝 Abstract

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI performance in long-horizon algorithm engineering tasks

Benchmarking AI on hard optimization problems without exact solutions

Assessing iterative solution refinement capabilities of AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

ALE-Bench evaluates AI on hard optimization problems

Supports iterative refinement with test-run feedback

Uses interactive agent architectures with visualizations

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?