End-to-End Test-Time Training for Long Context

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work reframes long-context language modeling as a continual learning problem, eschewing complex architectural modifications. Methodologically, it employs a standard sliding-window Transformer and introduces test-time weight updates driven solely by next-token prediction—enabling real-time compression of long-range context during inference. To improve adaptation efficiency, meta-learning is used during training to optimize the initialization strategy for test-time optimization, yielding an end-to-end Test-Time Training (TTT) framework. The key contribution is the first realization of a purely next-token-driven, parameter-free, and constant-latency test-time weight update mechanism. Experiments on a 3B-parameter model demonstrate that the approach matches the 128K-context extension capability of full-attention Transformers while achieving a 2.7× speedup in inference throughput; crucially, memory footprint and latency remain constant regardless of context length.

Technology Category

Application Category

📝 Abstract

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Formulates long-context modeling as continual learning, not architecture design.

Uses test-time training via next-token prediction to compress context into weights.

Improves initialization via meta-learning for efficient learning during inference.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time training via next-token prediction

Meta-learning for improved test-time initialization

Sliding-window attention with constant inference latency

🔎 Similar Papers

How to Train Long-Context Language Models (Effectively)