🤖 AI Summary
Existing semantic navigation benchmarks lack fine-grained evaluation of language understanding capabilities—particularly the agent’s ability to accurately ground linguistic elements (e.g., attributes, spatial relations, and category hierarchies) across varying levels of descriptive granularity. To address this gap, we introduce LangNavBench, the first language-centric, open-set evaluation benchmark for semantic navigation. We further propose Multi-Layered Fine-grained Mapping (MLFM), a queryable, hierarchical semantic mapping framework that explicitly models fine-grained language semantics by integrating large vision-language models, semantic mapping, and natural language grounding techniques. Experiments demonstrate that MLFM significantly outperforms state-of-the-art map-based navigation methods on LangNavBench, achieving substantial improvements in both navigation accuracy—especially for small-object localization—and consistency between navigation behavior and linguistic intent.
📝 Abstract
Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Despite these advances, we still lack a clear, language-focused benchmark for testing how well such agents ground the words in their instructions. We address this gap with LangNav, an open-set dataset specifically created to test an agent's ability to locate objects described at different levels of detail, from broad category names to fine attributes and object-object relations. Every description in LangNav was manually checked, yielding a lower error rate than existing lifelong- and semantic-navigation datasets. On top of LangNav we build LangNavBench, a benchmark that measures how well current semantic-navigation methods understand and act on these descriptions while moving toward their targets. LangNavBench allows us to systematically compare models on their handling of attributes, spatial and relational cues, and category hierarchies, offering the first thorough, language-centric evaluation of embodied navigation systems. We also present Multi-Layered Feature Map (MLFM), a method that builds a queryable multi-layered semantic map, particularly effective when dealing with small objects or instructions involving spatial relations. MLFM outperforms state-of-the-art mapping-based navigation baselines on the LangNav dataset.