Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study addresses "library drift"—a degradation phenomenon in self-evolving large language model skill libraries caused by unconstrained accumulation of skills, leading to retrieval deterioration, erroneous skill injection, and performance stagnation. The work presents the first isolation and reproduction of this issue and introduces a results-driven skill lifecycle management mechanism that dynamically prunes ineffective skills through capacity constraints, meta-skill priors, and lightweight governance. A fine-grained, interpretable diagnostic and governance framework is established via skill contribution scoring, attribution analysis, routing traceability, and rolling evaluation. Evaluated over 100 evolution rounds on the MBPP+ hard-100 benchmark, the approach achieves a sustained performance gain, improving held-out pass@1 from 0.258 to 0.584 (+0.328).

📝 Abstract

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

Problem

Research questions and friction points this paper is trying to address.

library drift

self-evolving LLM

skill libraries

retrieval degradation

performance stagnation

Innovation

Methods, ideas, or system contributions that make the work stand out.

library drift

self-evolving LLMs

skill lifecycle management