VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-and-Language Navigation (VLN) benchmarks suffer from limited scale, oversimplified physics simulation, and fragmented task definitions, hindering large-model pretraining and sim-to-real generalization research. To address these limitations, we introduce the first large-scale, extensible embodied VLN benchmark, built upon the high-fidelity Unity/PhysX physics engine and supporting full-kinematic agent simulation. We propose a unified General–Embodied–Realistic (GER) framework, enabling end-to-end cross-task generalization across diverse VLN benchmarks (e.g., R2R, CVDN, RxR) for the first time. Additionally, we develop an MLLM-driven navigation policy and an automated evaluation methodology. Comprehensive experiments demonstrate that our unified model achieves state-of-the-art generalization performance across all major VLN subtasks, bridging the critical gap in sim-to-real navigation evaluation and advancing embodied intelligence toward generality and scalability.

Technology Category

Application Category

📝 Abstract
Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of small-scale, fixed VLN datasets
Unifies fragmented tasks into a single extensible framework
Enhances realistic simulation for sim-to-real generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale extensible benchmark for VLN
Unified framework for previously fragmented tasks
Realistic simulation with full-kinematics physics engine
Sihao Lin
Sihao Lin
Postdoc, AIML, The University of Adelaide
Artificial intelligencePattern recognitionVision-language model
Zerui Li
Zerui Li
Adelaide Univeristy
RoboticsComputer VisionEmbodied AI
X
Xunyi Zhao
Adelaide University; Responsible AI Research Centre, Australian Institute for Machine Learning
Gengze Zhou
Gengze Zhou
The University of Adelaide
Embodied AIMultimodality
Liuyi Wang
Liuyi Wang
Tongji University
computer visionnatural language processingartificial intelligence
R
Rong Wei
ManyCore
R
Rui Tang
ManyCore
Juncheng Li
Juncheng Li
East China Normal University
Super ResolutionImage RestorationComputer VisionMedical Image Analysis
H
Hanqing Wang
Shanghai AI Lab
J
Jiangmiao Pang
Shanghai AI Lab
Anton van den Hengel
Anton van den Hengel
Director of the Centre for Augmented Reasoning at AIML, and CommBank AI Scholar
Computer VisionMachine LearningVisual Question Answering
J
Jiajun Liu
Responsible AI Research Centre, Australian Institute for Machine Learning; CSIRO Data61
Q
Qi Wu
Adelaide University; Responsible AI Research Centre, Australian Institute for Machine Learning