BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge of language-guided local navigation under target occlusion, where existing methods struggle to accurately infer traversable regions. To overcome this limitation, the paper proposes the first approach that models language-conditioned navigation in an egocentric bird’s-eye-view (BEV) space. By fusing multi-view RGB-D observations with natural language instructions, the method predicts a traversability heatmap that explicitly accounts for occluded areas. A novel spatial prompt injection mechanism is introduced to effectively integrate geometric depth cues with outputs from a vision-language model (VLM). Evaluated in the Habitat simulation environment, the proposed method achieves a 22.74 percentage point improvement in accuracy over the current state-of-the-art image-space approaches on a validation set containing occluded targets, measured under the mean geodesic distance threshold metric.

Technology Category

Application Category

📝 Abstract

Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: https://xin-yu-gao.github.io/beacon.

Problem

Research questions and friction points this paper is trying to address.

language-conditioned navigation

occlusion

spatial grounding

affordance prediction

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bird's-Eye View (BEV)

vision-language navigation

occlusion-aware affordance prediction