SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

πŸ“… 2026-05-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

219K/year
πŸ€– AI Summary
Existing video generation models struggle to precisely control the roles, timing, and action targets of multiple agents in social interactions, often resulting in behavioral mismatches and incoherent dynamics. This work proposes SocialDirector, a training-free interaction controller that enables fine-grained manipulation of β€œwho performs what action on whom and when” by modulating cross-attention mechanisms. The approach integrates two core components: Social Actor Masking, which restricts each agent to attend only to its own textual description, and Directional Reweighting, which amplifies attention to directional tokens specifying interaction targets. Coupled with an off-the-shelf vision-language model, the framework also introduces an automated evaluation pipeline. Experiments demonstrate that SocialDirector significantly enhances interaction fidelity across diverse video generation models, achieving performance approaching the upper bound of real videos, and introduces the first dataset and benchmark with fine-grained interaction annotations.
πŸ“ Abstract
Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person's visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., "leftward", "right"), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.
Problem

Research questions and friction points this paper is trying to address.

multi-person video generation
social interaction control
actor-action mismatch
directional actions
interaction fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free control
cross-attention modulation
social interaction generation
multi-person video synthesis
directional attention reweighting
πŸ”Ž Similar Papers