VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing Vision-and-Language Navigation (VLN) benchmarks suffer from limited scale, oversimplified physics simulation, and fragmented task definitions, hindering large-model pretraining and sim-to-real generalization research. To address these limitations, we introduce the first large-scale, extensible embodied VLN benchmark, built upon the high-fidelity Unity/PhysX physics engine and supporting full-kinematic agent simulation. We propose a unified General–Embodied–Realistic (GER) framework, enabling end-to-end cross-task generalization across diverse VLN benchmarks (e.g., R2R, CVDN, RxR) for the first time. Additionally, we develop an MLLM-driven navigation policy and an automated evaluation methodology. Comprehensive experiments demonstrate that our unified model achieves state-of-the-art generalization performance across all major VLN subtasks, bridging the critical gap in sim-to-real navigation evaluation and advancing embodied intelligence toward generality and scalability.

Technology Category

Application Category

📝 Abstract

Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of small-scale, fixed VLN datasets

Unifies fragmented tasks into a single extensible framework

Enhances realistic simulation for sim-to-real generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale extensible benchmark for VLN

Unified framework for previously fragmented tasks

Realistic simulation with full-kinematics physics engine

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

2024-07-09arXiv.orgCitations: 7