🤖 AI Summary
Existing coding agents rely on fixed prompts and heuristic rules, limiting their ability to effectively abstract, select, and continuously refine reusable procedural skills. This work proposes CODESKILL, a framework that formalizes skill management as a learnable problem for the first time. It leverages reinforcement learning to automatically extract multi-granularity skills from coding trajectories and employs a hybrid reward mechanism combining dense rewards (based on scoring criteria) and sparse rewards (based on downstream task performance) to enable dynamic evolution and compact maintenance of the skill library. Evaluated on EnvBench, SWE-Bench Verified, and Terminal-Bench 2, the approach improves average pass rates by 9.69% over skill-free baselines and outperforms the strongest prompt- or memory-based baselines by 4.01%, while maintaining a stable skill library size.
📝 Abstract
Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.