π€ AI Summary
This work addresses key challenges in harness engineering for automated coding agentsβnamely, heterogeneous action spaces, sparse and noisy evaluation signals, lengthy execution trajectories, and difficulty in attributing the effects of edits. The paper introduces the first observability-driven framework for autonomous harness evolution, leveraging component-, experience-, and decision-level observability to transform each edit into a verifiable contract, thereby enabling traceable, distillable, and verifiable self-optimization. Core technical contributions include component-level file representations, hierarchical compression of trajectory evidence, and edit-prediction alignment verification, which collectively support reversible editing and autonomous decision-making. Over ten evolution rounds, the method improves pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing both human-designed and self-evolution baselines; the frozen harness further demonstrates strong generalization across multiple model families and on SWE-bench.
π Abstract
Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.