AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing benchmarks struggle to effectively evaluate the algorithmic reasoning capabilities of large reasoning models. To address this gap, this work proposes the first algorithm-centric evaluation framework, featuring a fine-grained benchmark comprising over 3,000 original problems spanning 27 algorithm categories, meticulously curated by ACM experts. The study introduces a multi-dimensional assessment methodology and reveals that while leading models achieve up to 92% accuracy on non-optimization tasks, their performance sharply drops to approximately 49% on global optimization algorithms—such as dynamic programming—exposing fundamental limitations in comprehending complex algorithmic structures. Furthermore, the work identifies, for the first time, a phenomenon termed “strategic premature deviation,” shedding new light on model failure modes in algorithmic reasoning.

Technology Category

Application Category

📝 Abstract

Reasoning ability has become a central focus in the advancement of Large Reasoning Models (LRMs). Although notable progress has been achieved on several reasoning benchmarks such as MATH500 and LiveCodeBench, existing benchmarks for algorithmic reasoning remain limited, failing to answer a critical question: Do LRMs truly master algorithmic reasoning? To answer this question, we propose AlgBench, an expert-curated benchmark that evaluates LRMs under an algorithm-centric paradigm. AlgBench consists of over 3,000 original problems spanning 27 algorithms, constructed by ACM algorithmic experts and organized under a comprehensive taxonomy, including Euclidean-structured, non-Euclidean-structured, non-optimized, local-optimized, global-optimized, and heuristic-optimized categories. Empirical evaluations on leading LRMs (e.g., Gemini-3-Pro, DeepSeek-v3.2-Speciale and GPT-o3) reveal substantial performance heterogeneity: while models perform well on non-optimized tasks (up to 92%), accuracy drops sharply to around 49% on globally optimized algorithms such as dynamic programming. Further analysis uncovers \textbf{strategic over-shifts}, wherein models prematurely abandon correct algorithmic designs due to necessary low-entropy tokens. These findings expose fundamental limitations of problem-centric reinforcement learning and highlight the necessity of an algorithm-centric training paradigm for robust algorithmic reasoning.

Problem

Research questions and friction points this paper is trying to address.

algorithmic reasoning

Large Reasoning Models

reasoning benchmarks

algorithm understanding

model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

algorithmic reasoning

large reasoning models

benchmark design