LHAW: Controllable Underspecification for Long-Horizon Tasks

πŸ“… 2026-02-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of a scalable, task-agnostic framework for systematically studying the impact of ambiguity on agent behavior in long-horizon workflows. The authors propose LHAWβ€”a modular, dataset-agnostic synthetic pipeline that controllably removes information along four dimensions (goals, constraints, inputs, and context) to transform well-defined tasks into variants with varying degrees of ambiguity. Based on agent execution outcomes, the framework categorizes ambiguity types, enabling the first systematic, configurable generation and cost-sensitive evaluation of ambiguity in long-horizon tasks. Moving beyond reliance on large language model predictions, LHAW empirically validates ambiguity effects through agent trials. The study releases 285 ambiguous task variants derived from TheAgentCompany, SWE-Bench Pro, and MCP-Atlas, establishing the first benchmark framework for evaluating clarification behaviors and revealing critical bottlenecks in current agents’ ability to detect and handle ambiguity.

Technology Category

Application Category

πŸ“ Abstract
Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.
Problem

Research questions and friction points this paper is trying to address.

long-horizon tasks
underspecification
workflow agents
ambiguity
clarification
Innovation

Methods, ideas, or system contributions that make the work stand out.

underspecification
long-horizon agents
workflow augmentation
ambiguity evaluation
clarification behavior
πŸ”Ž Similar Papers
No similar papers found.