LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context evaluation benchmarks struggle to balance scalability and realism: synthetic tasks lack real-world complexity, while human annotation is prohibitively expensive. To address this, this work introduces a benchmark comprising 1,500 naturally occurring bilingual (Chinese–English) long-text samples, spanning 11 main tasks and 25 subtasks, constructed via a human-in-the-loop pipeline that ensures high quality while improving efficiency. The study innovatively proposes a multidimensional taxonomy—based on context dependency, length, and difficulty—and designs task-specific metrics. Evaluation of 46 prominent models reveals that explicit long-context optimization outperforms mere parameter scaling; actual effective context lengths consistently fall short of claimed values, with notable cross-lingual performance misalignment; and hybrid reasoning paradigms achieve a better trade-off between performance and efficiency.

Technology Category

Application Category

📝 Abstract
The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world complexity, while fully manual annotation is costly to scale to extreme lengths and diverse scenarios. We present LongBench Pro, a more realistic and comprehensive bilingual benchmark of 1,500 naturally occurring long-context samples in English and Chinese spanning 11 primary tasks and 25 secondary tasks, with input lengths from 8k to 256k tokens. LongBench Pro supports fine-grained analysis with task-specific metrics and a multi-dimensional taxonomy of context requirement (full vs. partial dependency), length (six levels), and difficulty (four levels calibrated by model performance). To balance quality with scalability, we propose a Human-Model Collaborative Construction pipeline: frontier LLMs draft challenging questions and reference answers, along with design rationales and solution processes, to reduce the cost of expert verification. Experts then rigorously validate correctness and refine problematic cases. Evaluating 46 widely used long-context LLMs on LongBench Pro yields three findings: (1) long-context optimization contributes more to long-context comprehension than parameter scaling; (2) effective context length is typically shorter than the claimed context length, with pronounced cross-lingual misalignment; and (3) the"thinking"paradigm helps primarily models trained with native reasoning, while mixed-thinking designs offer a promising Pareto trade-off. In summary, LongBench Pro provides a robust testbed for advancing long-context understanding.
Problem

Research questions and friction points this paper is trying to address.

long-context evaluation
benchmark realism
bilingual LLMs
scalability
context length
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-context evaluation
bilingual benchmark
human-model collaboration
fine-grained taxonomy
scalable annotation
🔎 Similar Papers
No similar papers found.
Ziyang Chen
Ziyang Chen
Peking University
Quantum key distributionQuantum random number generation
X
Xing Wu
Institute of Information Engineering, Chinese Academy of Sciences
J
Junlong Jia
School of Artificial Intelligence, Beihang University
Chaochen Gao
Chaochen Gao
Institute of Information Engineering,Chinese Academy of Sciences
NLP Long-Context LLM
Q
Qi Fu
Xiaohongshu Inc.
Debing Zhang
Debing Zhang
Xiaohongshu
Machine LearningComputer VisionDeep Learning
S
Songlin Hu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences