100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

πŸ“… 2025-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing long-context evaluation benchmarks (e.g., LongBench) suffer from two key limitations: (1) their metrics conflate pre-existing knowledge with genuine long-range modeling capability, compromising cross-model comparability; and (2) they employ fixed input lengths, limiting generalizability and hindering precise identification of context-length failure thresholds. Method: We propose the first length-adaptive, task-agnostic evaluation paradigm, introducing a normalized performance decomposition metric to disentangle intrinsic knowledge from contextual reasoning ability, and constructing a controllable, realistic document QA and summarization benchmark. Our approach employs dynamic length sampling and multi-model validation. Contribution/Results: This framework systematically uncovers each model’s context-length bottleneck for the first time, significantly enhancing both comparability and interpretability in long-context capability assessment. It enables fine-grained, length-resolved analysis while maintaining task relevance and real-world applicability.

Technology Category

Application Category

πŸ“ Abstract
Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating true long-context ability in LLMs
Separating long-context performance from baseline metrics
Assessing model breakdown points across varying input lengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Length-controllable benchmark for diverse model testing
Novel metric separates baseline from long-context ability
Superior evaluation of LLMs' true long-context performance
πŸ”Ž Similar Papers
No similar papers found.