LongReasonArena: A Long Reasoning Benchmark for Large Language Models

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing long-context benchmarks primarily evaluate input comprehension, neglecting systematic assessment of extended reasoning chains—such as retrieval and backtracking. This work introduces LongReasonArena, the first benchmark specifically designed to evaluate large language models’ (LLMs) long-reasoning capabilities. It employs multi-step algorithmic tasks to emulate realistic reasoning processes, supporting reasoning chains spanning thousands to millions of tokens. We propose a novel scalable reasoning-length control mechanism enabling precise adjustment of reasoning depth. Furthermore, we empirically uncover a previously unreported logarithmic-linear decay in model accuracy with increasing reasoning steps. Extensive experiments reveal severe performance limitations across state-of-the-art open- and closed-source models (e.g., DeepSeek-R1 achieves only 7.5% accuracy), confirming LongReasonArena’s high difficulty and diagnostic utility for probing fundamental reasoning bottlenecks in LLMs.

Technology Category

Application Category

📝 Abstract

Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long reasoning abilities in large language models

Assessing multi-step algorithmic problem solving capabilities

Scaling reasoning length up to 1 million tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

LongReasonArena benchmark for long reasoning evaluation

Multi-step algorithms requiring retrieval and backtracking

Scalable reasoning length up to 1M tokens

🔎 Similar Papers

No similar papers found.