Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a previously overlooked long-context bottleneck: even under perfect retrieval—where irrelevant tokens are fully masked and no interference exists—LLM performance degrades significantly with increasing input length (13.9%–85%), irrespective of retrieval quality, indicating that input length itself imposes an intrinsic limitation. To address this, the authors propose a model-agnostic, prompt-based mitigation strategy: instructing the model to paraphrase key evidence, thereby explicitly transforming long-context tasks into short-context ones. The method is rigorously evaluated across mathematical reasoning, question answering, and code generation tasks using five open- and closed-source LMs. Experiments employ evidence-prefacing, token masking, and the RULER benchmark for controlled assessment. Results confirm consistent attenuation of the length effect; notably, on RULER, the approach boosts GPT-4o’s strong baseline by an additional 4%.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures -- the models' inability to identify relevant information in the long inputs. Accordingly, recent efforts often focus on evaluating and improving LLMs' retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one -- or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%--85%) as input length increases but remains well within the models' claimed lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.
Problem

Research questions and friction points this paper is trying to address.

Long input length degrades LLM performance despite perfect retrieval
Performance drops occur even when irrelevant tokens are masked
The study reveals context length alone hurts model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reciting retrieved evidence before problem solving
Transforming long-context tasks into short-context
Model-agnostic prompting strategy improves performance
🔎 Similar Papers
No similar papers found.
Y
Yufeng Du
University of Illinois at Urbana-Champaign
Minyang Tian
Minyang Tian
University of Illinois at Urbana, Champaign
AI4SciencePhysicslarge language models
Srikanth Ronanki
Srikanth Ronanki
Amazon
Speech RecognitionNatural language processingArtificial intelligence
Subendhu Rongali
Subendhu Rongali
Amazon AGI
Natural Language ProcessingSemantic ParsingLow-Resource Language UnderstandingVoice Assistants
S
Sravan Bodapati
Amazon.com Inc.
Aram Galstyan
Aram Galstyan
USC ISI & Amazon AGI
Machine LearningNLUGraphs
A
Azton Wells
Argonne National Laboratory
R
Roy Schwartz
The Hebrew University of Jerusalem
E
Eliu A Huerta
Argonne National Laboratory, University of Chicago
H
Hao Peng
University of Illinois at Urbana-Champaign