Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

πŸ“… 2026-02-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of long-horizon reasoning in vision-and-language navigation (VLN) within unseen environments, where coarse-grained instructions often hinder effective decision-making. To this end, the authors propose constructing a multimodal spatiotemporal knowledge graph, termed YE-KG, and introduce a coarse-to-fine hierarchical retrieval mechanism that integrates structured event knowledge into the navigation model. They present the first approach to automatically mine semantic-action-effect triplets from real-world indoor videos, leveraging multimodal large language models such as LLaVA and GPT-4 to extract events and build a large-scale event knowledge graph. Additionally, an episodic-memory-inspired event-centric augmentation strategy is incorporated to enrich contextual understanding. The proposed method achieves state-of-the-art performance across the REVERIE, R2R, and R2R-CE benchmarks, significantly improving navigation accuracy under diverse action spaces.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory. Second, we introduce STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism. This allows agents to retrieve causal event sequences and dynamically fuse them with egocentric visual observations. Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our event-centric strategy, outperforming state-of-the-art approaches across diverse action spaces. Our data and code are available on the project website https://sites.google.com/view/y-event-kg/.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
long-horizon reasoning
coarse-grained instructions
multimodal event knowledge
unseen environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

event-centric knowledge
multimodal knowledge graph
vision-language navigation
episodic memory
hierarchical retrieval
πŸ”Ž Similar Papers
No similar papers found.
Haoxuan Xu
Haoxuan Xu
Beihang University
computer vision
T
Tianfu Li
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China
W
Wenbo Chen
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China
Yi Liu
Yi Liu
ζΈ…εŽε€§ε­¦
ζœΊε™¨δΊΊθ§†θ§‰ SLAM
Xingxing Zuo
Xingxing Zuo
Assistant Professor @MBZUAI
RoboticsState EstimationEmbodied AI
Y
Yaoxian Song
Hangzhou City University, Hangzhou 310015, China
Haoang Li
Haoang Li
Assistant Professor, Hong Kong University of Science and Technology (Guangzhou)
Robotics3D Computer Vision