EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI agents struggle to acquire complex skills *during testing*, exhibiting “intelligent yet ignorant intern” behavior—severely limiting real-world deployment. To address this, we propose EvoTest, the first framework for gradient-free, parameter-efficient evolutionary test-time learning (ETTL). EvoTest employs two synergistic agents: an Actor Agent that executes tasks, and an Evolver Agent that analyzes interaction trajectories to iteratively optimize prompts, memory, hyperparameters, and tool-use strategies—*without gradient updates or parameter fine-tuning*. We introduce J-TTL, a novel benchmark evaluating agents’ ability to self-improve across multi-turn game episodes. Experiments demonstrate that EvoTest significantly outperforms state-of-the-art baselines—including reflection, memory augmentation, and online fine-tuning—achieving the only full completions of the highly challenging *Detective* and *Library* games in J-TTL. These results validate the efficacy and generalizability of autonomous evolutionary adaptation at test time.

Technology Category

Application Category

📝 Abstract
A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
Problem

Research questions and friction points this paper is trying to address.

Developing agents that learn complex skills dynamically during test time
Overcoming limitations of reflection and memory in agent adaptation
Enabling self-improvement in novel environments without gradient fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary test-time learning without fine-tuning
Evolver Agent analyzes transcripts for system reconfiguration
Updates prompts, memory, hyperparameters, and tool routines