NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?

📅 2024-07-16

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing long-context evaluation methods suffer from confounding model priors and reduced validity due to filler-based construction. To address this, we propose the first bilingual, controllable synthetic benchmarking framework. Our method precisely embeds critical information points to construct two distinct scenarios—information-sparse and information-dense (e.g., the “Ancestral Trajectory Challenge”)—enabling systematic assessment of retrieval and reasoning capabilities. We introduce an information-density-driven bimodal evaluation paradigm and, for the first time, identify and formalize the “under-thinking” phenomenon, supporting adaptive context-length adjustment and fine-grained capability attribution. Leveraging synthetic data generation, deep positional embedding control, and context sensitivity analysis, we validate the framework within the OpenCompass pipeline. Experiments reveal significant performance degradation in state-of-the-art models—including DeepSeek-R1 and o3—under information-dense conditions, demonstrating the framework’s high discriminability and diagnostic utility.

Technology Category

Application Category

📝 Abstract

The capability of large language models to handle long-context information is crucial across various real-world applications. Existing evaluation methods often rely either on real-world long texts, making it difficult to exclude the influence of models' inherent knowledge, or introduce irrelevant filler content to artificially achieve target lengths, reducing assessment effectiveness. To address these limitations, we introduce NeedleBench, a synthetic framework for assessing retrieval and reasoning performance in bilingual long-context tasks with adaptive context lengths. NeedleBench systematically embeds key data points at varying depths to rigorously test model capabilities. Tasks are categorized into two scenarios: information-sparse, featuring minimal relevant details within extensive irrelevant text to simulate simple retrieval tasks; and information-dense (the Ancestral Trace Challenge), where relevant information is continuously distributed throughout the context to simulate complex reasoning tasks. Our experiments reveal that although recent reasoning models like Deepseek-R1 and OpenAI's o3 excel in mathematical reasoning, they struggle with continuous retrieval and reasoning in information-dense scenarios, even at shorter context lengths. We also characterize a phenomenon termed 'under-thinking', where models prematurely conclude reasoning despite available information. NeedleBench thus provides critical insights and targeted tools essential for evaluating and improving LLMs' long-context capabilities. All resources are available at OpenCompass: https://github.com/open-compass/opencompass.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' retrieval and reasoning in long-context tasks

Assessing model performance in information-sparse and dense scenarios

Identifying 'under-thinking' phenomenon in continuous reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic framework for bilingual long-context tasks

Embedding key data points at varying depths

Testing models in sparse and dense scenarios

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues