Self-Improving LLM Agents at Test-Time

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing language model fine-tuning relies on large-scale labeled datasets, suffering from inefficient data collection, high training costs, and uncertain generalization—further exacerbated by the absence of sample-wise information gain assessment, which risks introducing redundancy. To address this, we propose a test-time self-improvement (TTSI) paradigm: it detects uncertainty to identify challenging samples, dynamically generates semantically similar synthetic data, and performs immediate fine-tuning—establishing a closed loop of “self-awareness → self-augmentation → instant learning.” This enables autonomous model evolution *during deployment*, eliminating the need for offline retraining. We design two complementary architectures—TT-SI (self-augmentation) and TT-D (distillation-augmented)—which collectively achieve an average +5.48% accuracy gain across multiple agent benchmarks, using only 1.47% (i.e., 1/68) of the sample count required by conventional methods, thereby substantially improving both efficiency and generalization.

Technology Category

Application Category

📝 Abstract

One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post-training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self-awareness), (ii) then generates similar examples from detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-improvement). We study two variants of this approach: Test-Time Self-Improvement (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test-Time Distillation (TT-D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT-SI, demonstrating the potential of self-improvement algorithms at test-time as a new paradigm for building more capable agents toward self-evolution.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM agents through test-time self-improvement algorithms

Reducing training costs by generating samples from uncertain cases

Improving model generalization with minimal training data requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-improving algorithm identifies uncertain samples automatically

Generates similar examples from detected uncertain samples

Uses generated samples for test-time fine-tuning

🔎 Similar Papers

No similar papers found.