AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability, poor auditability, and lack of regression control commonly observed in existing self-evolution methods for large language model (LLM) agents. The authors reframe agent evolution as a release engineering problem and propose a regression-aware, externalized release pipeline. By decoupling fault diagnosis from implementation and incorporating a behavior-flipping gating mechanism, the approach extracts symptom-level quality signals from execution traces to generate auditable engineering specifications. Evaluated on execution-intensive benchmarks, the method achieves consistent performance gains, substantially reduces regression errors, and produces a reproducible, auditable single-version evolutionary trajectory.

Technology Category

Application Category

📝 Abstract
Recent progress in large language model (LLM) agents has largely focused on embedding self-improvement mechanisms inside the agent or searching over many concurrent variants. While these approaches can raise aggregate scores, they often yield unstable and hard-to-audit improvement trajectories, making it difficult to guarantee non-regression or to reason about failures across versions. We reframe agent improvement as \textbf{release engineering}: agents are treated as shippable artifacts, and improvement is externalized into a regression-aware release pipeline. We introduce \textbf{AgentDevel}, a release engineering pipeline that iteratively runs the current agent, produces implementation-blind, symptom-level quality signals from execution traces, synthesizes a single release candidate (RC) via executable diagnosis, and promotes it under flip-centered gating. AgentDevel features three core designs: (i) an implementation-blind LLM critic that characterizes failure appearances without accessing agent internals, (ii) script-based executable diagnosis that aggregates dominant symptom patterns and produces auditable engineering specifications, and (iii) flip-centered gating that prioritizes pass to fail regressions and fail to pass fixes as first-class evidence. Unlike population-based search or in-agent self-refinement, AgentDevel maintains a single canonical version line and emphasizes non-regression as a primary objective. Experiments on execution-heavy benchmarks demonstrate that AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts. Overall, AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
self-evolution
non-regression
release engineering
auditability
Innovation

Methods, ideas, or system contributions that make the work stand out.

release engineering
implementation-blind critic
executable diagnosis
flip-centered gating
non-regression
🔎 Similar Papers
No similar papers found.