Using Vision-Language Models as Proxies for Social Intelligence in Human-Robot Interaction

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robots must discern appropriate timing for human–robot interaction in dynamic everyday environments, yet existing approaches struggle to model temporally evolving, subtle nonverbal cues. This paper proposes a two-stage social timing decision framework: first, a lightweight gaze-shift and spatial-relation detector identifies critical social trigger signals in real time; second, a vision-language model (VLM) is dynamically invoked within temporal video context, guided by customized prompt engineering to enable fine-grained interpretation of nonverbal cues. Crucially, the VLM functions as a socially intelligent agent—activated selectively only at key decision points—to balance computational efficiency with semantic depth. Evaluated on replayed real-world café scenarios, our method significantly improves the accuracy and naturalness of robotic social responses. Results validate the efficacy of selective VLM invocation for embodied social intelligence, demonstrating that targeted, context-aware large-model utilization outperforms continuous or static alternatives in timing-critical interactive settings.

Technology Category

Application Category

📝 Abstract
Robots operating in everyday environments must often decide when and whether to engage with people, yet such decisions often hinge on subtle nonverbal cues that unfold over time and are difficult to model explicitly. Drawing on a five-day Wizard-of-Oz deployment of a mobile service robot in a university cafe, we analyze how people signal interaction readiness through nonverbal behaviors and how expert wizards use these cues to guide engagement. Motivated by these observations, we propose a two-stage pipeline in which lightweight perceptual detectors (gaze shifts and proxemics) are used to selectively trigger heavier video-based vision-language model (VLM) queries at socially meaningful moments. We evaluate this pipeline on replayed field interactions and compare two prompting strategies. Our findings suggest that selectively using VLMs as proxies for social reasoning enables socially responsive robot behavior, allowing robots to act appropriately by attending to the cues people naturally provide in real-world interactions.
Problem

Research questions and friction points this paper is trying to address.

Robots detect human nonverbal cues for engagement decisions
Vision-language models act as proxies for social intelligence
Selective VLM queries enable socially responsive robot behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language models for social reasoning
Two-stage pipeline with lightweight detectors
Selective VLM queries at key moments
🔎 Similar Papers
No similar papers found.