π€ AI Summary
This work addresses the lack of a scalable, task-agnostic framework for systematically studying the impact of ambiguity on agent behavior in long-horizon workflows. The authors propose LHAWβa modular, dataset-agnostic synthetic pipeline that controllably removes information along four dimensions (goals, constraints, inputs, and context) to transform well-defined tasks into variants with varying degrees of ambiguity. Based on agent execution outcomes, the framework categorizes ambiguity types, enabling the first systematic, configurable generation and cost-sensitive evaluation of ambiguity in long-horizon tasks. Moving beyond reliance on large language model predictions, LHAW empirically validates ambiguity effects through agent trials. The study releases 285 ambiguous task variants derived from TheAgentCompany, SWE-Bench Pro, and MCP-Atlas, establishing the first benchmark framework for evaluating clarification behaviors and revealing critical bottlenecks in current agentsβ ability to detect and handle ambiguity.
π Abstract
Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.