HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing large vision-language models (VLMs) face dual bottlenecks in human-robot interaction (HRI): high latency in real-time human perception and insufficient perceptual capability. To address this, we introduce HRIBench—the first multi-dimensional visual question answering benchmark explicitly designed for real-time perception in HRI. It encompasses five core tasks: nonverbal cue understanding, spoken instruction interpretation, human-robot-object relational reasoning, social navigation awareness, and person identification—integrating both newly collected real-world HRI data and four established public datasets. Using HRIBench, we systematically evaluate 11 state-of-the-art closed- and open-source VLMs, revealing consistently limited performance on fundamental perception tasks and widespread failure to meet real-time constraints (<100 ms). This work establishes the first HRI-oriented joint evaluation paradigm that co-assesses accuracy and inference latency, providing a critical benchmark and empirical foundation for developing lightweight, low-latency VLMs tailored to interactive robotics.

Technology Category

Application Category

📝 Abstract

Real-time human perception is crucial for effective human-robot interaction (HRI). Large vision-language models (VLMs) offer promising generalizable perceptual capabilities but often suffer from high latency, which negatively impacts user experience and limits VLM applicability in real-world scenarios. To systematically study VLM capabilities in human perception for HRI and performance-latency trade-offs, we introduce HRIBench, a visual question-answering (VQA) benchmark designed to evaluate VLMs across a diverse set of human perceptual tasks critical for HRI. HRIBench covers five key domains: (1) non-verbal cue understanding, (2) verbal instruction understanding, (3) human-robot object relationship understanding, (4) social navigation, and (5) person identification. To construct HRIBench, we collected data from real-world HRI environments to curate questions for non-verbal cue understanding, and leveraged publicly available datasets for the remaining four domains. We curated 200 VQA questions for each domain, resulting in a total of 1000 questions for HRIBench. We then conducted a comprehensive evaluation of both state-of-the-art closed-source and open-source VLMs (N=11) on HRIBench. Our results show that, despite their generalizability, current VLMs still struggle with core perceptual capabilities essential for HRI. Moreover, none of the models within our experiments demonstrated a satisfactory performance-latency trade-off suitable for real-time deployment, underscoring the need for future research on developing smaller, low-latency VLMs with improved human perception capabilities. HRIBench and our results can be found in this Github repository: https://github.com/interaction-lab/HRIBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs for real-time human perception in HRI

Assessing performance-latency trade-offs in vision-language models

Benchmarking VLMs across key human-robot interaction domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

HRIBench benchmark for VLM evaluation

Real-time human perception in HRI

Performance-latency trade-off analysis

🔎 Similar Papers

No similar papers found.