100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing long-context evaluation benchmarks (e.g., LongBench) suffer from two key limitations: (1) their metrics conflate pre-existing knowledge with genuine long-range modeling capability, compromising cross-model comparability; and (2) they employ fixed input lengths, limiting generalizability and hindering precise identification of context-length failure thresholds. Method: We propose the first length-adaptive, task-agnostic evaluation paradigm, introducing a normalized performance decomposition metric to disentangle intrinsic knowledge from contextual reasoning ability, and constructing a controllable, realistic document QA and summarization benchmark. Our approach employs dynamic length sampling and multi-model validation. Contribution/Results: This framework systematically uncovers each model’s context-length bottleneck for the first time, significantly enhancing both comparability and interpretability in long-context capability assessment. It enables fine-grained, length-resolved analysis while maintaining task relevance and real-world applicability.

Technology Category

Application Category

📝 Abstract

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating true long-context ability in LLMs

Separating long-context performance from baseline metrics

Assessing model breakdown points across varying input lengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Length-controllable benchmark for diverse model testing

Novel metric separates baseline from long-context ability

Superior evaluation of LLMs' true long-context performance

🔎 Similar Papers

NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?