🤖 AI Summary
This work addresses the challenges of sparse rewards in long-horizon stochastic environments, where external execution structures (harnesses) for agents suffer from sparse feedback, high variance, and attribution difficulties during evolution. The paper introduces DemoEvolve, the first approach to incorporate human demonstration trajectories into harness evolution. By leveraging demonstration-guided code proposals, harness-level editing policies, and self-rollout-based evolutionary search, DemoEvolve achieves efficient task adaptation without updating the underlying language model weights. The method substantially improves diagnosability, localizability, and stability under sparse feedback, generating more effective and auditable harness modifications within the same computational budget. Experiments on tasks such as Balatro demonstrate that DemoEvolve outperforms purely self-evolved or text-knowledge-only baselines in both performance and reliability.
📝 Abstract
Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can acquire task-specific competence by changing its external harness, while leaving the base model's general capabilities intact. Prior work shows that self-generated rollouts can support harness search, suggesting that agents may acquire new task competence through practice. Yet in long-horizon stochastic environments, self-practice becomes fragile: rewards are sparse, outcomes are high-variance, and failures are hard to attribute to concrete harness mechanisms. We introduce DemoEvolve, a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments on Liar's Dice show that self-rollout evolution can work when episodes are short and failures are attributable. In contrast, Balatro exposes a harder long-horizon stochastic regime, where self-rollout evolution is misled by sparse feedback and candidate-selection noise, while tutorial-like textual knowledge alone does not yield stable improvement. Under the same limited budget, DemoEvolve produces more effective and auditable harness edits and achieves better performance. Overall, demonstrations make sparse-feedback harness evolution more diagnosable, localizable, and stable.