LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

📅 2024-02-06
🏛️ arXiv.org
📈 Citations: 56
Influential: 9
📄 PDF
🤖 AI Summary
Existing long-context LLM benchmarks suffer from insufficient average context lengths (5K–21K), severe knowledge leakage, and coarse-grained evaluation metrics, leading to distorted performance assessments. To address these limitations, this work introduces the first bilingual long-context benchmark spanning five length tiers (16K–256K) and integrating 11 single- and multi-hop question-answering datasets. We propose three novel technical contributions: (i) confounding fact injection to mitigate knowledge leakage, (ii) lexical perturbation augmentation for robustness, and (iii) keyword-recall-driven fine-grained evaluation for precise, length-controllable, and high-challenge assessment. Extensive experiments across 15 mainstream LLMs reveal that large models (e.g., Qwen-2.5-72B) achieve peak performance at ≤64K contexts; long-context-specialized models (e.g., GLM-4-9B-128K) exhibit slower performance decay but do not necessarily outperform generalist models in absolute accuracy; and our methodology substantially alleviates evaluation bias induced by knowledge leakage.

Technology Category

Application Category

📝 Abstract
State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of"needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks inadequately evaluate long-context LLMs up to 256k
Current benchmarks suffer from knowledge leakage and inaccurate evaluation metrics
LV-Eval provides a balanced benchmark with controlled length levels and objective metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Confusing facts insertion to prevent knowledge leakage
Keyword and phrase replacement for challenging instances
Keyword-recall-based metric design for objective evaluation
🔎 Similar Papers
No similar papers found.