Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

πŸ“… 2026-04-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

235K/year
πŸ€– AI Summary
This work addresses key challenges in harness engineering for automated coding agentsβ€”namely, heterogeneous action spaces, sparse and noisy evaluation signals, lengthy execution trajectories, and difficulty in attributing the effects of edits. The paper introduces the first observability-driven framework for autonomous harness evolution, leveraging component-, experience-, and decision-level observability to transform each edit into a verifiable contract, thereby enabling traceable, distillable, and verifiable self-optimization. Core technical contributions include component-level file representations, hierarchical compression of trajectory evidence, and edit-prediction alignment verification, which collectively support reversible editing and autonomous decision-making. Over ten evolution rounds, the method improves pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing both human-designed and self-evolution baselines; the frozen harness further demonstrates strong generalization across multiple model families and on SWE-bench.
πŸ“ Abstract
Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.
Problem

Research questions and friction points this paper is trying to address.

harness engineering
coding agent
observability
automatic evolution
agent performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Harness Engineering
observability-driven evolution
coding agents
automated harness optimization
trajectory distillation