LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the insufficient semantic alignment between language and environment, as well as the lack of multi-granularity evaluation benchmarks in open-vocabulary object navigation, by introducing HieraNav—a task that requires agents to navigate to targets spanning four semantic levels (scene, room, region, and instance) in real 3D indoor environments based on natural language instructions. To support this, we present LangMap, a large-scale benchmark dataset featuring human-verified, high-quality discriminative language descriptions that enable open-vocabulary navigation across all four granularities for the first time. Experiments show that LangMap achieves a 23.8% higher discriminative accuracy than GOAT-Bench while using 75% fewer words. Evaluation further reveals that contextual and memory mechanisms improve success rates, yet challenges persist in long-tail categories, small or distant objects, and multi-target scenarios.

Technology Category

Application Category

📝 Abstract

The relationships between objects and language are fundamental to meaningful communication between humans and AI, and to practically useful embodied intelligence. We introduce HieraNav, a multi-granularity, open-vocabulary goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), a large-scale benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations and tasks spanning these levels. LangMap provides region labels, discriminative region descriptions, discriminative instance descriptions covering 414 object categories, and over 18K navigation tasks. Each target features both concise and detailed descriptions, enabling evaluation across different instruction styles. LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words. Comprehensive evaluations of zero-shot and supervised models on LangMap reveal that richer context and memory improve success, while long-tailed, small, context-dependent, and distant goals, as well as multi-goal completion, remain challenging. HieraNav and LangMap establish a rigorous testbed for advancing language-driven embodied navigation. Project: https://bo-miao.github.io/LangMap

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary goal navigation

embodied intelligence

natural language instructions

semantic levels

3D indoor navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary navigation

hierarchical semantic grounding

language-driven embodied AI