Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models struggle to accurately execute complex language instructions in 3D environments that involve both semantic and metric constraints. This work proposes the MAPG framework, which structurally decomposes instructions and employs a multi-agent system to collaboratively invoke vision-language models for separate semantic and metric grounding. The framework then probabilistically fuses these results to produce consistent and executable action decisions. Introducing, for the first time, a multi-agent probabilistic fusion mechanism to address the joint metric-semantic grounding problem, this study also constructs MAPG-Bench—the first dedicated benchmark for this task—and demonstrates significant performance gains over strong baselines on HM-EQA. The method’s effectiveness is validated across both simulated and real-world robotic platforms, achieving successful cross-domain transfer.

Technology Category

Application Category

📝 Abstract
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
Problem

Research questions and friction points this paper is trying to address.

vision-language navigation
metric-semantic grounding
probabilistic grounding
3D scene understanding
language-to-action
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Probabilistic Grounding
Vision-Language Navigation
Metric-Semantic Grounding
Structured Language Decomposition
3D Actionable Decision
🔎 Similar Papers
No similar papers found.
S
Swagat Padhan
Arizona State University
L
Lakshya Jain
Arizona State University
B
Bhavya Minesh Shah
Arizona State University
O
Omkar Patil
Arizona State University
Thao Nguyen
Thao Nguyen
Assistant Professor, Haverford College
RoboticsHuman Robot InteractionArtificial IntelligenceNatural Language Processing
Nakul Gopalan
Nakul Gopalan
Assistant Professor Arizona State University
RoboticsNatural LanguageReinforcement Learning