Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

๐Ÿ“… 2025-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

174K/year
๐Ÿค– AI Summary
Large language models (LLMs) often underperform on unlabeled, out-of-distribution, and structurally novel reasoning tasks. To address this, we propose a validator-driven test-time training (TTT) framework. Our method employs a lightweight validator to score candidate responses, dynamically selecting high-confidence pseudo-labeled samples for online, unsupervised adaptation via low-rank LoRA fine-tuning onlyโ€”eliminating the need for full-parameter updates or human annotations. This enables efficient, continuous, and resource-light model self-improvement. Compared to full-parameter TTT, our approach significantly reduces computational overhead; unlike conventional validator-based methods, it avoids static evaluation and manual labeling. Evaluated across three benchmarks and three state-of-the-art LLMs, our framework achieves up to 32.29% absolute improvement over baselines and 6.66% over validator-only methods without TTT, while converging faster and requiring fewer resources.

Technology Category

Application Category

๐Ÿ“ Abstract
Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.
Problem

Research questions and friction points this paper is trying to address.

Adapting pretrained LLMs to unlabeled out-of-distribution data
Improving model performance on novel reasoning tasks
Efficient self-supervised test-time training with verifier-driven selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier-driven sample selection for training
Low-rank LoRA adapter fine-tuning
Self-supervised test-time training framework
๐Ÿ”Ž Similar Papers