GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap

๐Ÿ“… 2026-01-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing methods for evaluating navigation instructions either rely on reference texts, which poorly capture real-world guidance efficacy, or require costly visual simulations that are susceptible to perceptual errors. This work proposes the first vision-free, training-free hierarchical large language model (LLM) evaluation framework that leverages OpenStreetMapโ€™s topological structure and landmark information. By encoding spatial knowledge into structured JSON/text representations and integrating sub-instruction planning with graph-based reasoning, the framework enables interpretable and scalable functional assessment. Evaluated on the Map2Seq dataset, the approach reduces navigation error by 68.5% compared to heuristic and sampling-based baselines, using execution success rate, trajectory fidelity, and decision patterns as effective proxy metrics for instruction quality.

Technology Category

Application Category

๐Ÿ“ Abstract
The evaluation of navigation instructions remains a persistent challenge in Vision-and-Language Navigation (VLN) research. Traditional reference-based metrics such as BLEU and ROUGE fail to capture the functional utility of spatial directives, specifically whether an instruction successfully guides a navigator to the intended destination. Although existing VLN agents could serve as evaluators, their reliance on high-fidelity visual simulators introduces licensing constraints and computational costs, and perception errors further confound linguistic quality assessment. This paper introduces GROKE(Graph-based Reasoning over OSM Knowledge for instruction Evaluation), a vision-free training-free hierarchical LLM-based framework for evaluating navigation instructions using OpenStreetMap data. Through systematic ablation studies, we demonstrate that structured JSON and textual formats for spatial information substantially outperform grid-based and visual graph representations. Our hierarchical architecture combines sub-instruction planning with topological graph navigation, reducing navigation error by 68.5% compared to heuristic and sampling baselines on the Map2Seq dataset. The agent's execution success, trajectory fidelity, and decision patterns serve as proxy metrics for functional navigability given OSM-visible landmarks and topology, establishing a scalable and interpretable evaluation paradigm without visual dependencies. Code and data are available at https://anonymous.4open.science/r/groke.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
navigation instruction evaluation
functional utility
OpenStreetMap
reference-based metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-free navigation
graph reasoning
OpenStreetMap
instruction evaluation
hierarchical LLM