🤖 AI Summary
Large language model agents struggle to selectively enhance the specific capabilities required for particular tasks. This work proposes the first end-to-end self-improvement framework that automatically identifies capability gaps by contrasting successful and failed execution trajectories, synthesizes capability-oriented training environments based on these insights, and trains lightweight LoRA adapters via reinforcement learning. During inference, the system dynamically routes to the most relevant adapter to augment performance. Evaluated on the τ²-bench, the method achieves a 14.1-point improvement and secures seven additional perfect scores in ToolSandbox, substantially outperforming the strongest baseline while demonstrating higher training efficiency under the same number of rollouts.
📝 Abstract
Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is performing one or more actions in a trajectory that are necessary for successfully solving a subset of tasks in the environment. Many existing approaches either rely on synthetic training data that is not targeted to the model's actual capability deficits in the target environment or train directly on the target environment, where the model needs to implicitly learn the capabilities across tasks. We introduce TRACE (Turning Recurrent Agent failures into Capability-targeted training Environments), an end-to-end system for environment-specific agent self-improvement. TRACE contrasts successful and failed trajectories to automatically identify lacking capabilities, synthesizes a targeted training environment for each that rewards whether the capability was exercised, and trains a LoRA adapter via RL on each synthetic environment, routing to the relevant adapter at inference. Empirically, TRACE generalizes across different environments, improving over the base agent by +14.1 points on $τ^2$-bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming the strongest baseline by +7.4 points and +4 perfect scores, respectively. Given the same number of rollouts, TRACE scales more efficiently than baselines, outperforming GRPO and GEPA by +9.2 and +7.4 points on $τ^2$-bench.