🤖 AI Summary
This study addresses the challenge of designing effective inference-time harnesses to improve the long-term execution success of large language model agents in complex tasks. Recognizing that excessive decomposition or guidance can degrade performance, the work formalizes the harness design as a trajectory alignment problem and decouples it into two mechanisms: task decomposition and guided execution. It systematically investigates the impact of workflow granularity, retry budgets, and action reweighting. The authors propose a “partial harness” strategy—specifying only the initial steps—which effectively steers execution while avoiding failure modes such as over-decomposition, over-pruning, and hallucinated actions. Empirical validation in both synthetic environments and real-world terminal agent tasks demonstrates that this approach significantly enhances task completion rates, outperforming fully structured workflows.
📝 Abstract
Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.