π€ AI Summary
This work addresses the challenge that large language model (LLM) agents struggle to effectively accumulate and reuse experience, as existing approaches extract only flat textual knowledge, failing to capture the procedural logic of complex subtasks and lacking robust mechanisms for knowledge base maintenance. To overcome this, we propose AutoRefine, a framework that, for the first time, enables dual extraction and co-maintenance of procedural and static experience: the former is embodied in specialized sub-agents with independent reasoning and memory capabilities, while the latter is distilled into reusable skill patternsβsuch as guidelines or code snippets. A continuous scoring, pruning, and merging mechanism prevents experience degradation over time. Experiments demonstrate that AutoRefine achieves success rates of 98.4%, 70.4%, and 27.1% on ALFWorld, ScienceWorld, and TravelPlanner, respectively, reducing action steps by 20β73% and substantially outperforming handcrafted systems on TravelPlanner (27.1% vs. 12.1%).
π Abstract
Large language model agents often fail to accumulate knowledge from experience, treating each task as an independent challenge. Recent methods extract experience as flattened textual knowledge, which cannot capture procedural logic of complex subtasks. They also lack maintenance mechanisms, causing repository degradation as experience accumulates. We introduce AutoRefine, a framework that extracts and maintains dual-form Experience Patterns from agent execution histories. For procedural subtasks, we extract specialized subagents with independent reasoning and memory. For static knowledge, we extract skill patterns as guidelines or code snippets. A continuous maintenance mechanism scores, prunes, and merges patterns to prevent repository degradation. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, AutoRefine achieves 98.4%, 70.4%, and 27.1% respectively, with 20-73% step reductions. On TravelPlanner, automatic extraction exceeds manually designed systems (27.1% vs 12.1%), demonstrating its ability to capture procedural coordination.