🤖 AI Summary
This work addresses the limitations of large language model agents when confronted with ambiguous programming instructions, a challenge often stemming from their inability to recognize missing contextual information. To overcome this, the authors propose an uncertainty-aware multi-agent framework that decouples ambiguity detection from code execution, enabling agents to autonomously identify uncertainties and proactively solicit necessary clarifications. Implemented on the OpenHands platform using the Claude Sonnet 4.5 model, this approach represents the first systematic realization of well-calibrated clarification-seeking behavior in coding agents. Evaluated on a fuzzy-instruction variant of SWE-bench Verified, the method achieves a task success rate of 69.40%, substantially outperforming a single-agent baseline (61.20%) and approaching performance levels observed under fully specified instructions, thereby advancing agents from passive executors toward proactive collaborators.
📝 Abstract
As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.