SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) exhibit insufficient capability to support socially aware navigation for robots in dynamic human environments. Method: We introduce SocialNav-SUB, the first benchmark for social navigation scene understanding, comprising spatial, spatiotemporal, and social reasoning visual question-answering tasks. It is built upon a curated dataset of real-world human interaction videos and systematically evaluates VLMs’ ability to infer agent relationships and recognize human intentions. Contribution/Results: We propose a dual-baseline evaluation framework—comparing against both human experts and rule-based methods—revealing that state-of-the-art VLMs, while approaching human consistency on certain tasks, significantly underperform both baselines on critical social reasoning dimensions. Experiments expose fundamental deficiencies in VLMs’ contextual understanding required for socially compliant and safe navigation. SocialNav-SUB establishes a reproducible evaluation paradigm and identifies concrete directions for future advancement.

Technology Category

Application Category

📝 Abstract
Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' scene understanding in social robot navigation
Assessing spatial-temporal and social reasoning capabilities of VLMs
Benchmarking VLM performance against human and rule-based baselines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SocialNav-SUB benchmark for VLM evaluation
Uses VQA dataset for spatial and social reasoning
Compares VLMs against human and rule-based baselines
M
Michael J. Munje
Department of Computer Science, The University of Texas at Austin
C
Chen Tang
Department of Computer Science, The University of Texas at Austin
Shuijing Liu
Shuijing Liu
Postdoc, The University of Texas at Austin
Robot LearningHuman Robot Interaction
Z
Zichao Hu
Department of Computer Science, The University of Texas at Austin
Y
Yifeng Zhu
Department of Computer Science, The University of Texas at Austin
Jiaxun Cui
Jiaxun Cui
The University of Texas at Austin
Reinforcement LearningMulti-agent LearningGame Theory
Garrett Warnell
Garrett Warnell
Research Scientist, Army Research Laboratory
Machine LearningRoboticsArtificial Intelligence
Joydeep Biswas
Joydeep Biswas
Associate Professor, Computer Science Department, The University of Texas at Austin
RoboticsArtificial IntelligenceMulti Robot SystemsLocalizationMapping
P
Peter Stone
Department of Computer Science, The University of Texas at Austin, Sony AI