SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Current vision-language models struggle to translate natural language instructions into spatially consistent and executable trajectories in 3D environments, exhibiting systematic failures particularly in scenarios involving local interactions, occlusions, and multi-step commands. To address this limitation, this work introduces SleepWalk—the first three-tiered, progressively challenging benchmark specifically designed for evaluating local interaction capabilities. SleepWalk integrates text-generated navigable 3D scenes, rendered visual observations, and a standardized point-to-point evaluation protocol to enable fine-grained assessment of spatial language grounding. Experiments across 2,472 scenes and 22,248 instructions reveal a significant performance drop in state-of-the-art models as task difficulty increases, underscoring the benchmark’s value as a stress test for embodied reasoning.

📝 Abstract

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Instruction Grounding

3D Environments

Embodied Reasoning

Spatial Coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language navigation

embodied reasoning

3D environment benchmarking