Harnesses for Inference-Time Alignment over Execution Trajectories

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the challenge of designing effective inference-time harnesses to improve the long-term execution success of large language model agents in complex tasks. Recognizing that excessive decomposition or guidance can degrade performance, the work formalizes the harness design as a trajectory alignment problem and decouples it into two mechanisms: task decomposition and guided execution. It systematically investigates the impact of workflow granularity, retry budgets, and action reweighting. The authors propose a “partial harness” strategy—specifying only the initial steps—which effectively steers execution while avoiding failure modes such as over-decomposition, over-pruning, and hallucinated actions. Empirical validation in both synthetic environments and real-world terminal agent tasks demonstrates that this approach significantly enhances task completion rates, outperforming fully structured workflows.

📝 Abstract

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

Problem

Research questions and friction points this paper is trying to address.

inference-time alignment

harness design

execution trajectories

task decomposition

guided execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time alignment

task decomposition

guided execution