In-Place Test-Time Training

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models are constrained by the static “train-then-deploy” paradigm, limiting their ability to dynamically adapt to new information during inference. This work proposes a plug-and-play test-time training framework that seamlessly integrates test-time learning into mainstream architectures for the first time. By performing efficient, task-aligned parameter updates on the final projection matrix of MLP modules—guided by a theoretically motivated objective function tailored for next-token prediction and enabled by a context-parallel block-wise update mechanism—the method achieves dynamic adaptation without requiring full pretraining from scratch. As a lightweight enhancement, it enables a 4B-parameter model to excel on tasks with 128k-token contexts; its fully trained variant also significantly outperforms existing test-time training approaches.
📝 Abstract
The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Test-Time Training
Large Language Models
Continual Learning
Fast Weights
Inference-time Adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Place Test-Time Training
fast weights
next-token prediction
context parallelism
continual learning
🔎 Similar Papers
No similar papers found.