Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Long-context large language models face two key bottlenecks: (1) static self-attention induces “score dilution,” impairing the model’s ability to attend selectively to critical information; and (2) existing test-time strategies—such as chain-of-thought generation—suffer sharp performance degradation in long-horizon, multi-step reasoning. To address these, we propose a test-time context-aware lightweight gradient update method. We theoretically characterize and overcome the fundamental limitation of static self-attention under long contexts for the first time. Further, we design a provably convergent test-time contextual training paradigm, replacing inefficient thought-token generation. Our approach integrates context-driven gradient updates, analytical self-attention mechanism diagnostics, and a controllable sandbox evaluation framework. On LongBench-v2 and ZeroScrolls, Qwen3-4B achieves average improvements of +12.6 and +14.1 percentage points, respectively—substantially outperforming existing test-time scaling methods.

Technology Category

Application Category

📝 Abstract

Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.

Problem

Research questions and friction points this paper is trying to address.

Addresses diminishing returns of inference-time compute in long-context LLMs

Identifies static self-attention causing score dilution in long contexts

Proposes targeted gradient updates to improve long-context retrieval and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Targeted gradient updates on context overcome static self-attention limitations

Test-time training uses inference compute for context-specific adaptation

Small context-specific training outperforms thinking token generation strategies

🔎 Similar Papers

How to Train Long-Context Language Models (Effectively)