The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

πŸ“… 2026-05-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

199K/year
πŸ€– AI Summary
This work addresses the inability of current vision-language models (VLMs) to abstain from responding to ambiguous, infeasible, or premise-faulty instructions in embodied robotic settings, highlighting the lack of systematic evaluation of their capacity to withhold responses under perceptual grounding and physical constraints. The authors propose RoboAbstention, a framework that leverages structured visual grounding, deterministic constraint reasoning, and templated instruction generation to construct the first auditable and scalable dataset of abstention-oriented instructions for embodied agents, along with a corresponding taxonomy and benchmark. Experiments reveal that mainstream VLMs exhibit low abstention ratesβ€”down to 16.5%β€”but when augmented with defensive prompting and in-context learning, Gemini Robotics ER 1.6 Preview and GPT-4o Mini achieve abstention rates of 93.6% and 88.6%, respectively, substantially mitigating their tendency toward indiscriminate affirmation.
πŸ“ Abstract
Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.
Problem

Research questions and friction points this paper is trying to address.

abstention
embodied robotics
vision-language models
instruction following
physical feasibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

abstention
embodied robotics
vision-language models
instruction grounding
RoboAbstention