🤖 AI Summary
This work reframes long-context language modeling as a continual learning problem, eschewing complex architectural modifications. Methodologically, it employs a standard sliding-window Transformer and introduces test-time weight updates driven solely by next-token prediction—enabling real-time compression of long-range context during inference. To improve adaptation efficiency, meta-learning is used during training to optimize the initialization strategy for test-time optimization, yielding an end-to-end Test-Time Training (TTT) framework. The key contribution is the first realization of a purely next-token-driven, parameter-free, and constant-latency test-time weight update mechanism. Experiments on a 3B-parameter model demonstrate that the approach matches the 128K-context extension capability of full-attention Transformers while achieving a 2.7× speedup in inference throughput; crucially, memory footprint and latency remain constant regardless of context length.
📝 Abstract
We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.