🤖 AI Summary
Existing social behavior modeling approaches are restricted to dyadic interactions and lack generalizability to scenarios with arbitrary numbers of participants.
Method: This paper introduces the first unified motion–language modeling framework for multi-person social interaction. It (1) designs a social motion tokenization mechanism to discretize pose sequences of variable participant counts; (2) constructs SocialX—the first large-scale multi-person interaction dataset with fine-grained textual annotations—and establishes corresponding benchmarks; and (3) proposes a motion–language joint pretraining paradigm that integrates cross-modal alignment with LLM-driven action understanding and generation for holistic multimodal reasoning.
Results: Extensive experiments demonstrate state-of-the-art performance across multi-person behavior generation, social reasoning, and cross-modal retrieval tasks. The framework significantly improves scalability and semantic consistency in modeling complex, real-world social scenes.
📝 Abstract
Human interactions in everyday life are inherently social, involving engagements with diverse individuals across various contexts. Modeling these social interactions is fundamental to a wide range of real-world applications. In this paper, we introduce SocialGen, the first unified motion-language model capable of modeling interaction behaviors among varying numbers of individuals, to address this crucial yet challenging problem. Unlike prior methods that are limited to two-person interactions, we propose a novel social motion representation that supports tokenizing the motions of an arbitrary number of individuals and aligning them with the language space. This alignment enables the model to leverage rich, pretrained linguistic knowledge to better understand and reason about human social behaviors. To tackle the challenges of data scarcity, we curate a comprehensive multi-human interaction dataset, SocialX, enriched with textual annotations. Leveraging this dataset, we establish the first comprehensive benchmark for multi-human interaction tasks. Our method achieves state-of-the-art performance across motion-language tasks, setting a new standard for multi-human interaction modeling.