Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the limited adaptability of large language model (LLM) agents during testing, which often fail when confronted with tasks beyond their static skill repertoire. To overcome this, the authors propose SkillTTA, a novel approach that enables fully context-based, parameter-free skill synthesis at test time. By retrieving relevant training trajectories and analyzing failed execution paths, SkillTTA dynamically constructs temporary textual skills tailored to the current task, thereby guiding a fixed solver to perform adaptive reasoning without any model updates. Evaluated on SpreadsheetBench and BigCodeBench, the method achieves Pass@1 scores of 0.505 and 0.651, respectively, and matches strong baselines in ALFWorld using the shortest successful trajectories, demonstrating substantially improved task generalization.

📝 Abstract

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose \emph{SkillTTA}, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-$k$ retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

Problem

Research questions and friction points this paper is trying to address.

Test-Time Adaptation

Skill Synthesis

LLM Agents

Task-Specific Guidance

Static Skill Library

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Adaptation

Skill Synthesis

LLM Agents