The Surprising Effectiveness of Test-Time Training for Abstract Reasoning

📅 2024-11-11
🏛️ arXiv.org
📈 Citations: 11
Influential: 1
📄 PDF
🤖 AI Summary
Language models exhibit limited out-of-distribution generalization on abstract reasoning tasks—such as those in the ARC benchmark—and perform poorly under few-shot adaptation. To address this, we propose a novel test-time training (TTT) paradigm structured around three core components: pre-finetuning, auxiliary task design, and instance-level optimization. Our approach enables dynamic, input-driven lightweight parameter updates during inference. It integrates task-format modeling, context-aware data augmentation, and model ensembling, and—critically—extends purely neural TTT to program generation–enhanced reasoning for the first time. On the public ARC validation set, our 8B-parameter model achieves 53.0% accuracy, surpassing prior state-of-the-art by 24.7 percentage points; with program generation integration, accuracy rises to 61.9%, matching human average performance for the first time. These results systematically demonstrate TTT’s pivotal role and scalability in abstract reasoning.

Technology Category

Application Category

📝 Abstract
Language models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT) -- updating model parameters temporarily during inference using a loss derived from input data -- as a mechanism for improving models' reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC's public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9%, matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.
Problem

Research questions and friction points this paper is trying to address.

Improving few-shot learning for novel tasks
Enhancing language model adaptability via test-time training
Boosting reasoning capabilities with in-context examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time training updates parameters during inference
Uses in-context examples for few-shot learning
Combines with program-synthesis for human-level performance
🔎 Similar Papers
No similar papers found.