🤖 AI Summary
This work addresses the challenges of evaluating and enhancing fine-grained social reasoning capabilities of multimodal models in authentic social contexts. To this end, we introduce the first evidence-traceable social reasoning benchmark—comprising 272 annotated interaction videos and 1,486 human-annotated reasoning chains—that requires joint integration of visual, linguistic, and acoustic cues with external knowledge. We formally define and quantify “grounded” social reasoning for the first time; propose the first framework that explicitly incorporates external knowledge into both modeling and evaluation; and design a multidimensional evaluation metric balancing semantic correctness and structural coherence. Comprehensive assessment of state-of-the-art multimodal models reveals consistent deficiencies in evidence citation, knowledge integration, and reasoning coherence. Our benchmark provides a reproducible foundation for social intelligence research and actionable pathways for model improvement.
📝 Abstract
Social reasoning abilities are crucial for AI systems to effectively interpret and respond to multimodal human communication and interaction within social contexts. We introduce Social Genome, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models. Social Genome contains 272 videos of interactions and 1,486 human-annotated reasoning traces related to inferences about these interactions. These traces contain 5,777 reasoning steps that reference evidence from visual cues, verbal cues, vocal cues, and external knowledge (contextual knowledge external to videos). Social Genome is also the first modeling challenge to study external knowledge in social reasoning. Social Genome computes metrics to holistically evaluate semantic and structural qualities of model-generated social reasoning traces. We demonstrate the utility of Social Genome through experiments with state-of-the-art models, identifying performance gaps and opportunities for future research to improve the grounded social reasoning abilities of multimodal models.