SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

๐Ÿ“… 2026-05-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
It remains unclear whether large language model agents can distill reusable procedural skills from task experience. This work proposes SkillEvolBench, a benchmark comprising 180 role-conditioned tasks that systematically disentangles procedural abstraction from foundational capabilities, prior knowledge, and experience reuse. The framework evaluates skill evolution through compact trajectory compression, verifier feedback, and multidimensional deployment testsโ€”including contextual transfer, adversarial shortcuts, and compositional generalization. Experiments reveal that current agents predominantly exhibit local adaptation, with distilled skills often underperforming direct reuse of original trajectories. Moreover, merely increasing the number of skills or computational resources fails to yield consistent performance gains, highlighting fundamental challenges in the formation of procedural knowledge.
๐Ÿ“ Abstract
Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.
Problem

Research questions and friction points this paper is trying to address.

procedural skills
episodic experience
skill distillation
experience reuse
LLM agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

procedural skill
episodic experience
skill distillation
agent benchmarking
trajectory compaction