GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

πŸ“… 2026-05-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

203K/year
πŸ€– AI Summary
Existing approaches struggle to accurately identify participants and their relationships in nonverbal social interactions from multi-person videos. This work proposes GRASP, a novel framework that, for the first time, models identity-consistent gaze trajectories, pointing gestures, and their combinations as fine-grained social events. To support this direction, the authors introduce GRASP, a large-scale dataset, along with GRASP-Bench, a dedicated evaluation benchmark. They further propose a Social Grounding Reward as a reinforcement learning signal to guide multimodal large language models in jointly reasoning over low-level behavioral cues and high-level social semantics. The method achieves significant performance gains on GRASP-Bench while preserving strong zero-shot generalization on other social video question-answering benchmarks, effectively bridging the gap between fine-grained perceptual signals and high-level social understanding.
πŸ“ Abstract
Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.
Problem

Research questions and friction points this paper is trying to address.

social reasoning
non-verbal interactions
gaze
deictic gesture
multi-person videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

social grounding
gaze–gesture reasoning
multimodal large language models
Social Grounding Reward
multi-person interaction
πŸ”Ž Similar Papers
No similar papers found.