🤖 AI Summary
This work addresses the challenge large language models face in balancing literal interpretation with contextual reasoning during interactive instruction following, often resorting to suboptimal clarification or inference strategies when confronted with ambiguous directives. To investigate this, the authors propose Build What I Mean (BWIM), an interactive benchmark grounded in psycholinguistic dual-speaker paradigms, featuring cooperative and strictly literal-reliable speaker tasks. Integrating confidence scoring with behavioral decision mechanisms, BWIM reveals for the first time that while models can detect speaker unreliability, they fail to adapt their clarification behavior accordingly. The study introduces a novel evaluation framework that quantifies communicative cost, uncovering a pervasive tendency among mainstream models to either over-clarify or avoid questioning altogether—highlighting fundamental limitations in efficient contextual reasoning.
📝 Abstract
We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.