🤖 AI Summary
This work addresses the persistent “agent–environment mismatch” in the self-evolution of large language model (LLM) agents, where agent capabilities continually improve while the learning environment provides static or weakly coupled supervisory signals. To resolve this, the authors propose SEAL, a framework that establishes the first closed-loop co-evolution between agent policy and learning environment. SEAL collects online execution trajectories, performs episode-level diagnosis on failed trajectories, and leverages diagnostic outcomes as a shared signal to jointly optimize both the environment—by dynamically adjusting tool prompts, constraints, and recovery feedback—and the agent policy via diagnosis-guided advantage reweighting. With only 400 samples, SEAL achieves average performance gains of 8.25–26.25 points across three backbone models, significantly enhancing self-evolution efficiency, robustness, and out-of-distribution generalization under low-resource conditions.
📝 Abstract
Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emph{Agent-Environment Misalignment}: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.