π€ AI Summary
Current Vision-and-Language Navigation (VLN) systems are largely confined to either discrete or continuous paradigms, limiting their ability to handle social interactions in dynamic, multi-pedestrian environments. To address this, we propose the first unified framework that explicitly models human social intent while jointly leveraging discrete and continuous navigation. Our method introduces a discrete-continuous cooperative task formulation and incorporates personal space constraints. We release HAPS 2.0βa large-scale human motion datasetβand an enhanced simulator; construct a human-centered instruction evaluation benchmark (16,844 instances); and validate real-world transfer in crowded physical settings. The approach integrates multi-agent simulation, motion-language alignment learning, partially observable reinforcement learning, and social-distance-aware planning. Experiments demonstrate significant improvements in navigation success rate and substantial reductions in collision frequency, confirming the critical role of social context modeling for safe navigation. All data, code, and evaluation tools are publicly released to advance standardization in human-centered VLN.
π Abstract
Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments. We introduce a unified Human-Aware VLN (HA-VLN) benchmark that merges these paradigms under explicit social-awareness constraints. Our contributions include: 1. A standardized task definition that balances discrete-continuous navigation with personal-space requirements; 2. An enhanced human motion dataset (HAPS 2.0) and upgraded simulators capturing realistic multi-human interactions, outdoor contexts, and refined motion-language alignment; 3. Extensive benchmarking on 16,844 human-centric instructions, revealing how multi-human dynamics and partial observability pose substantial challenges for leading VLN agents; 4. Real-world robot tests validating sim-to-real transfer in crowded indoor spaces; and 5. A public leaderboard supporting transparent comparisons across discrete and continuous tasks. Empirical results show improved navigation success and fewer collisions when social context is integrated, underscoring the need for human-centric design. By releasing all datasets, simulators, agent code, and evaluation tools, we aim to advance safer, more capable, and socially responsible VLN research.