🤖 AI Summary
Existing long-context LLM benchmarks suffer from insufficient average context lengths (5K–21K), severe knowledge leakage, and coarse-grained evaluation metrics, leading to distorted performance assessments. To address these limitations, this work introduces the first bilingual long-context benchmark spanning five length tiers (16K–256K) and integrating 11 single- and multi-hop question-answering datasets. We propose three novel technical contributions: (i) confounding fact injection to mitigate knowledge leakage, (ii) lexical perturbation augmentation for robustness, and (iii) keyword-recall-driven fine-grained evaluation for precise, length-controllable, and high-challenge assessment. Extensive experiments across 15 mainstream LLMs reveal that large models (e.g., Qwen-2.5-72B) achieve peak performance at ≤64K contexts; long-context-specialized models (e.g., GLM-4-9B-128K) exhibit slower performance decay but do not necessarily outperform generalist models in absolute accuracy; and our methodology substantially alleviates evaluation bias induced by knowledge leakage.
📝 Abstract
State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of"needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.