HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard

πŸ“… 2025-03-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current Vision-and-Language Navigation (VLN) systems are largely confined to either discrete or continuous paradigms, limiting their ability to handle social interactions in dynamic, multi-pedestrian environments. To address this, we propose the first unified framework that explicitly models human social intent while jointly leveraging discrete and continuous navigation. Our method introduces a discrete-continuous cooperative task formulation and incorporates personal space constraints. We release HAPS 2.0β€”a large-scale human motion datasetβ€”and an enhanced simulator; construct a human-centered instruction evaluation benchmark (16,844 instances); and validate real-world transfer in crowded physical settings. The approach integrates multi-agent simulation, motion-language alignment learning, partially observable reinforcement learning, and social-distance-aware planning. Experiments demonstrate significant improvements in navigation success rate and substantial reductions in collision frequency, confirming the critical role of social context modeling for safe navigation. All data, code, and evaluation tools are publicly released to advance standardization in human-centered VLN.

Technology Category

Application Category

πŸ“ Abstract
Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments. We introduce a unified Human-Aware VLN (HA-VLN) benchmark that merges these paradigms under explicit social-awareness constraints. Our contributions include: 1. A standardized task definition that balances discrete-continuous navigation with personal-space requirements; 2. An enhanced human motion dataset (HAPS 2.0) and upgraded simulators capturing realistic multi-human interactions, outdoor contexts, and refined motion-language alignment; 3. Extensive benchmarking on 16,844 human-centric instructions, revealing how multi-human dynamics and partial observability pose substantial challenges for leading VLN agents; 4. Real-world robot tests validating sim-to-real transfer in crowded indoor spaces; and 5. A public leaderboard supporting transparent comparisons across discrete and continuous tasks. Empirical results show improved navigation success and fewer collisions when social context is integrated, underscoring the need for human-centric design. By releasing all datasets, simulators, agent code, and evaluation tools, we aim to advance safer, more capable, and socially responsible VLN research.
Problem

Research questions and friction points this paper is trying to address.

Integrates discrete-continuous navigation with human-aware constraints.
Addresses challenges in dynamic, multi-human populated environments.
Validates sim-to-real transfer in crowded indoor spaces.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark merging discrete-continuous navigation paradigms
Enhanced human motion dataset capturing realistic interactions
Real-world robot tests validating sim-to-real transfer
Yifei Dong
Yifei Dong
KTH Royal Institute of Technology
Robotic manipulation
Fengyi Wu
Fengyi Wu
Unknown affiliation
Q
Qi He
University of Washington
H
Heng Li
University of Washington
M
Minghan Li
Galbot
Zebang Cheng
Zebang Cheng
Shenzhen University
AICVMLLMAffective Computing
Y
Yuxuan Zhou
University of Mannheim
J
Jingdong Sun
Carnegie Mellon University
Q
Qi Dai
Microsoft Research
Zhi-Qi Cheng
Zhi-Qi Cheng
Assistant Professor @ UW | Graduate Faculty | Ex-CMU, Google, Microsoft | Intel & IBM PhD Fellowship
multimedia processingmultimedia understandingmultimodal foundation model
A
Alexander G. Hauptmann
Carnegie Mellon University