🤖 AI Summary
Existing long-context evaluation benchmarks struggle to balance scalability and realism: synthetic tasks lack real-world complexity, while human annotation is prohibitively expensive. To address this, this work introduces a benchmark comprising 1,500 naturally occurring bilingual (Chinese–English) long-text samples, spanning 11 main tasks and 25 subtasks, constructed via a human-in-the-loop pipeline that ensures high quality while improving efficiency. The study innovatively proposes a multidimensional taxonomy—based on context dependency, length, and difficulty—and designs task-specific metrics. Evaluation of 46 prominent models reveals that explicit long-context optimization outperforms mere parameter scaling; actual effective context lengths consistently fall short of claimed values, with notable cross-lingual performance misalignment; and hybrid reasoning paradigms achieve a better trade-off between performance and efficiency.
📝 Abstract
The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world complexity, while fully manual annotation is costly to scale to extreme lengths and diverse scenarios. We present LongBench Pro, a more realistic and comprehensive bilingual benchmark of 1,500 naturally occurring long-context samples in English and Chinese spanning 11 primary tasks and 25 secondary tasks, with input lengths from 8k to 256k tokens. LongBench Pro supports fine-grained analysis with task-specific metrics and a multi-dimensional taxonomy of context requirement (full vs. partial dependency), length (six levels), and difficulty (four levels calibrated by model performance). To balance quality with scalability, we propose a Human-Model Collaborative Construction pipeline: frontier LLMs draft challenging questions and reference answers, along with design rationales and solution processes, to reduce the cost of expert verification. Experts then rigorously validate correctness and refine problematic cases. Evaluating 46 widely used long-context LLMs on LongBench Pro yields three findings: (1) long-context optimization contributes more to long-context comprehension than parameter scaling; (2) effective context length is typically shorter than the claimed context length, with pronounced cross-lingual misalignment; and (3) the"thinking"paradigm helps primarily models trained with native reasoning, while mixed-thinking designs offer a promising Pareto trade-off. In summary, LongBench Pro provides a robust testbed for advancing long-context understanding.