SVBench: Evaluation of Video Generation Models on Social Reasoning

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video generation models exhibit significant improvements in visual fidelity and motion coherence but consistently lack social reasoning capabilities—i.e., the ability to infer intentions, beliefs, emotions, and social norms from visual cues. To address this gap, we introduce the first benchmark dedicated to social cognition in video generation, covering seven dimensions: mental state inference, goal-directed behavior, joint attention, social norm compliance, emotion recognition, perspective-taking, and prosocial reasoning. Our method proposes a training-free, agent-driven evaluation framework grounded in developmental and social psychology paradigms, integrating cue-controllable scene synthesis with concept-neutralized assessment. We further design a multi-dimensional, interpretable evaluation protocol leveraging vision-language models (VLMs). Large-scale evaluation across seven state-of-the-art models reveals systematic failures on core social reasoning tasks—including intention recognition, false-belief reasoning, and prosocial inference—despite superficial plausibility, indicating a fundamental deficit in deep social logic.

Technology Category

Application Category

📝 Abstract
Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
Problem

Research questions and friction points this paper is trying to address.

Evaluates social reasoning in video generation models
Introduces benchmark for social cognition in generated videos
Assesses gaps in intention and belief reasoning in AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free agent-based pipeline for video generation evaluation
High-capacity VLM judge for multi-dimensional social reasoning assessment
Cue-based critique to enforce neutrality and control difficulty
🔎 Similar Papers
W
Wenshuo Peng
Tsinghua University
G
Gongxuan Wang
Shanghai AI Laboratory
Tianmeng Yang
Tianmeng Yang
Baidu ERNIE, Peking University
LLMRLMachine LearningData Mining
C
Chuanhao Li
Harbin Institute of Technology
X
Xiaojie Xu
The Hong Kong University of Science and Technology
H
Hui He
Harbin Institute of Technology
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC