GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that large language model agents often fail in structured environments due to insufficient procedural knowledge and that existing self-improvement methods frequently degrade previously acquired capabilities. To mitigate this, the authors propose a regression-constrained skill library editing mechanism that integrates contrastive skill proposal with regression-aware hard-budget gating. This approach updates the skill library only when newly acquired skills do not impair retained behaviors, thereby guaranteeing net performance gains. The method enables cross-model skill transfer and reveals an asymmetric benefit between stronger and weaker models. Evaluated on MedAgentBench, it boosts the performance of gpt-oss-120b from 40.6% to 88.8%, surpassing the strongest baseline by 21.0 points, and demonstrates robust generalization across multiple probing environments.
📝 Abstract
LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
self-improvement
regression
procedural knowledge
structured environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

regression-aware learning
skill library editing
gated self-improvement
LLM agents
hard regression budget
🔎 Similar Papers
No similar papers found.