🤖 AI Summary
Existing benchmarks for long-context reasoning evaluation are severely limited, predominantly featuring simple tasks that inadequately assess large language models’ (LLMs) capabilities in complex reasoning. Method: We introduce LongReason—the first synthetic, multi-task benchmark explicitly designed for complex reasoning over long contexts—comprising 794 multiple-choice questions spanning reading comprehension, logical reasoning, and mathematical problem solving. We propose a novel semantic-preserving context expansion technique to systematically transform short-text questions into diverse, length-varying long-context instances. Contribution/Results: Our evaluation across 21 mainstream LLMs under zero-shot and few-shot settings reveals a previously undocumented, pronounced performance degradation with increasing context length. LongReason demonstrates strong validity and generalizability, enabling rigorous assessment of long-context reasoning fidelity. The benchmark is publicly released to advance standardized evaluation of long-context reasoning capabilities.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus on a narrow range of tasks or those that do not demand complex reasoning. To address this gap and enable a more comprehensive evaluation of the long-context reasoning capabilities of current LLMs, we propose a new synthetic benchmark, LongReason, which is constructed by synthesizing long-context reasoning questions from a varied set of short-context reasoning questions through context expansion. LongReason consists of 794 multiple-choice reasoning questions with diverse reasoning patterns across three task categories: reading comprehension, logical inference, and mathematical word problems. We evaluate 21 LLMs on LongReason, revealing that most models experience significant performance drops as context length increases. Our further analysis shows that even state-of-the-art LLMs still have significant room for improvement in providing robust reasoning across different tasks. We will open-source LongReason to support the comprehensive evaluation of LLMs' long-context reasoning capabilities.