VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

πŸ“… 2025-12-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In embodied navigation, user instructions are often ambiguous, necessitating proactive dialogue from agents to clarify intent. To address this, we propose the **Interactive Instance Object Navigation (IION)** taskβ€”a novel vision-language-navigation (VL-N) benchmark supporting long-horizon navigation and active question-asking. Our benchmark comprises over 41K automatically synthesized multi-turn dialogues and an evaluation oracle capable of responding to natural-language queries. Methodologically, we design a multimodal encoder integrating ViT and LLMs, an explicit dialogue policy module, and employ joint reinforcement and supervised learning. We further introduce automated prompt engineering to generate high-quality dialogue data. Experiments demonstrate that our model significantly outperforms existing VL-N baselines, validating that proactive dialogue effectively mitigates instruction ambiguity and substantially improves success rate and robustness in long-horizon navigation.

Technology Category

Application Category

πŸ“ Abstract
In most existing embodied navigation tasks, instructions are well-defined and unambiguous, such as instruction following and object searching. Under this idealized setting, agents are required solely to produce effective navigation outputs conditioned on vision and language inputs. However, real-world navigation instructions are often vague and ambiguous, requiring the agent to resolve uncertainty and infer user intent through active dialog. To address this gap, we propose Interactive Instance Object Navigation (IION), a task that requires agents not only to generate navigation actions but also to produce language outputs via active dialog, thereby aligning more closely with practical settings. IION extends Instance Object Navigation (ION) by allowing agents to freely consult an oracle in natural language while navigating. Building on this task, we present the Vision Language-Language Navigation (VL-LN) benchmark, which provides a large-scale, automatically generated dataset and a comprehensive evaluation protocol for training and assessing dialog-enabled navigation models. VL-LN comprises over 41k long-horizon dialog-augmented trajectories for training and an automatic evaluation protocol with an oracle capable of responding to agent queries. Using this benchmark, we train a navigation model equipped with dialog capabilities and show that it achieves significant improvements over the baselines. Extensive experiments and analyses further demonstrate the effectiveness and reliability of VL-LN for advancing research on dialog-enabled embodied navigation. Code and dataset: https://0309hws.github.io/VL-LN.github.io/
Problem

Research questions and friction points this paper is trying to address.

Addresses ambiguous navigation instructions requiring active dialog
Proposes Interactive Instance Object Navigation with language outputs
Introduces VL-LN benchmark for dialog-enabled navigation model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Interactive Instance Object Navigation with active dialog
Creates VL-LN benchmark with large-scale dialog-augmented trajectories
Trains navigation model using automatic evaluation protocol with oracle
πŸ”Ž Similar Papers
No similar papers found.
W
Wensi Huang
University of Science and Technology of China
S
Shaohao Zhu
Shanghai AI Laboratory
M
Meng Wei
Shanghai AI Laboratory
J
Jinming Xu
Zhejiang University
Xihui Liu
Xihui Liu
University of Hong Kong, UC Berkeley, CUHK, Tsinghua University
Computer VisionDeep Learning
H
Hanqing Wang
Shanghai AI Laboratory
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
F
Feng Zhao
University of Science and Technology of China
J
Jiangmiao Pang
Shanghai AI Laboratory