Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often underperform on unlabeled, out-of-distribution, and structurally novel reasoning tasks. To address this, we propose a validator-driven test-time training (TTT) framework. Our method employs a lightweight validator to score candidate responses, dynamically selecting high-confidence pseudo-labeled samples for online, unsupervised adaptation via low-rank LoRA fine-tuning only—eliminating the need for full-parameter updates or human annotations. This enables efficient, continuous, and resource-light model self-improvement. Compared to full-parameter TTT, our approach significantly reduces computational overhead; unlike conventional validator-based methods, it avoids static evaluation and manual labeling. Evaluated across three benchmarks and three state-of-the-art LLMs, our framework achieves up to 32.29% absolute improvement over baselines and 6.66% over validator-only methods without TTT, while converging faster and requiring fewer resources.

Technology Category

Application Category

📝 Abstract
Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.
Problem

Research questions and friction points this paper is trying to address.

Adapting pretrained LLMs to unlabeled out-of-distribution data
Improving model performance on novel reasoning tasks
Efficient self-supervised test-time training with verifier-driven selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier-driven sample selection for training
Low-rank LoRA adapter fine-tuning
Self-supervised test-time training framework
🔎 Similar Papers
No similar papers found.
M
Mohammad Mahdi Moradi
Department of Computer Science, Concordia University, Ascend Team, Huawei Technologies
H
Hossam Amer
Ascend Team, Toronto Research Center, Huawei Technologies
Sudhir Mudur
Sudhir Mudur
Professor of Computer Science, Concordia University
Computer Graphics
W
Weiwei Zhang
Ascend Team, Toronto Research Center, Huawei Technologies
Y
Yang Liu
Ascend Team, Toronto Research Center, Huawei Technologies
Walid Ahmed
Walid Ahmed
Huawei Technologies Canada
Deep LearningMachine LearningSoft Computing