π€ AI Summary
Existing language-guided robotic systems predominantly rely on discrete point representations for spatial targets, rendering them vulnerable to perceptual noise and semantic ambiguity, thereby compromising robustness and interpretability. To address this, we propose RoboMAPβa novel framework that introduces adaptive affordance heatmaps as continuous, probabilistic spatial target representations, explicitly modeling uncertainty in spatial grounding. RoboMAP integrates vision-language models with nonparametric probability density estimation, enabling dense spatial reasoning, efficient integration with downstream policies, and cross-task zero-shot transfer. Evaluated on mainstream benchmarks, RoboMAP achieves state-of-the-art performance across all metrics, with inference speed accelerated by up to 50Γ. In real-world manipulation tasks, it attains an 82% success rate; in navigation tasks, it demonstrates strong zero-shot generalization capability without task-specific fine-tuning.
π Abstract
Many language-guided robotic systems rely on collapsing spatial reasoning into discrete points, making them brittle to perceptual noise and semantic ambiguity. To address this challenge, we propose RoboMAP, a framework that represents spatial targets as continuous, adaptive affordance heatmaps. This dense representation captures the uncertainty in spatial grounding and provides richer information for downstream policies, thereby significantly enhancing task success and interpretability. RoboMAP surpasses the previous state-of-the-art on a majority of grounding benchmarks with up to a 50x speed improvement, and achieves an 82% success rate in real-world manipulation. Across extensive simulated and physical experiments, it demonstrates robust performance and shows strong zero-shot generalization to navigation. More details and videos can be found at https://robo-map.github.io.