SocialGesture: Delving into Multi-person Gesture Understanding

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing gesture recognition research predominantly focuses on single-user scenarios, neglecting natural multi-person social interactions—thereby severely limiting joint modeling of gestures, language, and social intent. Method: We introduce SocialGesture, the first large-scale dataset for multi-person interactive gesture understanding, encompassing diverse naturalistic settings and supporting video-level recognition, temporal localization, and a novel Social Gesture Visual Question Answering (VQA) task. We systematically formalize multi-person gesture understanding, establish an end-to-end multimodal annotation and evaluation framework, and release it openly via Hugging Face. Results: Experiments reveal substantial performance degradation of state-of-the-art gesture models on group interaction tasks; moreover, current vision-language models (VLMs) exhibit pervasive deficits in social intent awareness. This work provides a reproducible, extensible foundation—both data resource and benchmark—for unified modeling of gestures, language, and social context.

Technology Category

Application Category

📝 Abstract
Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at huggingface.co/datasets/IrohXu/SocialGesture.
Problem

Research questions and friction points this paper is trying to address.

Overcoming multi-person gesture recognition limitations
Aligning gestures with language and speech
Advancing social interaction gesture analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for multi-person gesture analysis
Supports video recognition and temporal localization
Introduces VQA task for social gesture understanding
🔎 Similar Papers
No similar papers found.
X
Xu Cao
University of Illinois Urbana-Champaign
P
Pranav Virupaksha
University of Illinois Urbana-Champaign
W
Wenqi Jia
University of Illinois Urbana-Champaign
Bolin Lai
Bolin Lai
Georgia Institute of Technology
Multimodal LearningLLMImage GenerationVideo Generation
Fiona Ryan
Fiona Ryan
Georgia Institute of Technology
artificial intelligencecomputer vision
S
Sangmin Lee
Sungkyunkwan University
J
J. Rehg
University of Illinois Urbana-Champaign