HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models (VLMs) face dual bottlenecks in human-robot interaction (HRI): high latency in real-time human perception and insufficient perceptual capability. To address this, we introduce HRIBench—the first multi-dimensional visual question answering benchmark explicitly designed for real-time perception in HRI. It encompasses five core tasks: nonverbal cue understanding, spoken instruction interpretation, human-robot-object relational reasoning, social navigation awareness, and person identification—integrating both newly collected real-world HRI data and four established public datasets. Using HRIBench, we systematically evaluate 11 state-of-the-art closed- and open-source VLMs, revealing consistently limited performance on fundamental perception tasks and widespread failure to meet real-time constraints (<100 ms). This work establishes the first HRI-oriented joint evaluation paradigm that co-assesses accuracy and inference latency, providing a critical benchmark and empirical foundation for developing lightweight, low-latency VLMs tailored to interactive robotics.

Technology Category

Application Category

📝 Abstract
Real-time human perception is crucial for effective human-robot interaction (HRI). Large vision-language models (VLMs) offer promising generalizable perceptual capabilities but often suffer from high latency, which negatively impacts user experience and limits VLM applicability in real-world scenarios. To systematically study VLM capabilities in human perception for HRI and performance-latency trade-offs, we introduce HRIBench, a visual question-answering (VQA) benchmark designed to evaluate VLMs across a diverse set of human perceptual tasks critical for HRI. HRIBench covers five key domains: (1) non-verbal cue understanding, (2) verbal instruction understanding, (3) human-robot object relationship understanding, (4) social navigation, and (5) person identification. To construct HRIBench, we collected data from real-world HRI environments to curate questions for non-verbal cue understanding, and leveraged publicly available datasets for the remaining four domains. We curated 200 VQA questions for each domain, resulting in a total of 1000 questions for HRIBench. We then conducted a comprehensive evaluation of both state-of-the-art closed-source and open-source VLMs (N=11) on HRIBench. Our results show that, despite their generalizability, current VLMs still struggle with core perceptual capabilities essential for HRI. Moreover, none of the models within our experiments demonstrated a satisfactory performance-latency trade-off suitable for real-time deployment, underscoring the need for future research on developing smaller, low-latency VLMs with improved human perception capabilities. HRIBench and our results can be found in this Github repository: https://github.com/interaction-lab/HRIBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs for real-time human perception in HRI
Assessing performance-latency trade-offs in vision-language models
Benchmarking VLMs across key human-robot interaction domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

HRIBench benchmark for VLM evaluation
Real-time human perception in HRI
Performance-latency trade-off analysis
🔎 Similar Papers
No similar papers found.
Z
Zhonghao Shi
Thomas Lord Department of Computer Science, Viterbi School of Engineering, University of Southern California, Los Angeles CA 90089, USA
E
Enyu Zhao
Thomas Lord Department of Computer Science, Viterbi School of Engineering, University of Southern California, Los Angeles CA 90089, USA
Nathaniel Dennler
Nathaniel Dennler
Postdoctoral researcher, Massachusetts Institute of Technology
Human-Robot InteractionAssistive RoboticsPreference LearningPersonalizationCustomization
Jingzhen Wang
Jingzhen Wang
University of Southern California
Machine LearningImitation Learning
X
Xinyang Xu
Thomas Lord Department of Computer Science, Viterbi School of Engineering, University of Southern California, Los Angeles CA 90089, USA
K
Kaleen Shrestha
Thomas Lord Department of Computer Science, Viterbi School of Engineering, University of Southern California, Los Angeles CA 90089, USA
Mengxue Fu
Mengxue Fu
University of Southern California
Artificial IntelligenceRobotics
Daniel Seita
Daniel Seita
University of Southern California
RoboticsMachine Learning
M
Maja Matarić
Thomas Lord Department of Computer Science, Viterbi School of Engineering, University of Southern California, Los Angeles CA 90089, USA