Open-Vocabulary 3D Instruction Ambiguity Detection

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the critical safety risks posed by ambiguous language instructions in embodied interactions within 3D environments, where current 3D large language models generally lack the ability to assess instruction clarity. We introduce the first open-vocabulary 3D instruction ambiguity detection task, which requires models to determine whether a natural language instruction has a unique and unambiguous interpretation given a 3D scene. To support this research direction, we construct Ambi3D, a large-scale benchmark comprising over 700 scenes and 22,000 instructions. We further propose AmbiVer, a two-stage discriminative framework that leverages multi-view visual evidence to guide reasoning about ambiguity. Experiments reveal that existing 3D foundation models perform poorly on this task, while AmbiVer significantly outperforms baseline approaches, laying a crucial foundation for developing safe and trustworthy embodied AI systems.

Technology Category

Application Category

📝 Abstract

In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like"Pass me the vial"in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.

Problem

Research questions and friction points this paper is trying to address.

instruction ambiguity

open-vocabulary

3D scene understanding

embodied AI

safety-critical

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-Vocabulary 3D Instruction Ambiguity Detection

Ambi3D

AmbiVer