Improving Latent Generalization Using Test-time Compute

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Although language models encode vast knowledge in their weights, they struggle with deductive reasoning, limiting their generalization capabilities—exemplified by the reversal curse. This work proposes a novel approach that integrates test-time computation with reinforcement learning to train models to generate extended chains of thought, thereby activating and composing existing knowledge. It is the first to combine reinforcement learning with test-time reasoning mechanisms, enhancing out-of-distribution generalization without requiring task-specific data augmentation. Experimental results demonstrate that the method substantially mitigates in-distribution reasoning failures and effectively generalizes to unseen knowledge. While the model cannot directly invert factual statements in reversal tasks, its generate-and-verify strategy achieves performance far exceeding random chance.
📝 Abstract
Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or 'thinking', specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.
Problem

Research questions and friction points this paper is trying to address.

latent generalization
in-weights learning
deductive reasoning
reversal curse
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time compute
latent generalization
reinforcement learning
chain-of-thought
in-context learning
🔎 Similar Papers
No similar papers found.
Arslan Chaudhry
Arslan Chaudhry
DeepMind
Machine LearningArtificial Intelligence
S
Sridhar Thiagarajan
Google DeepMind
A
Andrew Lampinen
Work done while at Google DeepMind