SocialGen: Modeling Multi-Human Social Interaction with Language Models

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing social behavior modeling approaches are restricted to dyadic interactions and lack generalizability to scenarios with arbitrary numbers of participants. Method: This paper introduces the first unified motion–language modeling framework for multi-person social interaction. It (1) designs a social motion tokenization mechanism to discretize pose sequences of variable participant counts; (2) constructs SocialX—the first large-scale multi-person interaction dataset with fine-grained textual annotations—and establishes corresponding benchmarks; and (3) proposes a motion–language joint pretraining paradigm that integrates cross-modal alignment with LLM-driven action understanding and generation for holistic multimodal reasoning. Results: Extensive experiments demonstrate state-of-the-art performance across multi-person behavior generation, social reasoning, and cross-modal retrieval tasks. The framework significantly improves scalability and semantic consistency in modeling complex, real-world social scenes.

Technology Category

Application Category

📝 Abstract
Human interactions in everyday life are inherently social, involving engagements with diverse individuals across various contexts. Modeling these social interactions is fundamental to a wide range of real-world applications. In this paper, we introduce SocialGen, the first unified motion-language model capable of modeling interaction behaviors among varying numbers of individuals, to address this crucial yet challenging problem. Unlike prior methods that are limited to two-person interactions, we propose a novel social motion representation that supports tokenizing the motions of an arbitrary number of individuals and aligning them with the language space. This alignment enables the model to leverage rich, pretrained linguistic knowledge to better understand and reason about human social behaviors. To tackle the challenges of data scarcity, we curate a comprehensive multi-human interaction dataset, SocialX, enriched with textual annotations. Leveraging this dataset, we establish the first comprehensive benchmark for multi-human interaction tasks. Our method achieves state-of-the-art performance across motion-language tasks, setting a new standard for multi-human interaction modeling.
Problem

Research questions and friction points this paper is trying to address.

Modeling multi-human social interactions with varying participant numbers
Aligning motion representations with language space for better reasoning
Addressing data scarcity in multi-human interaction modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified motion-language model for multi-human interactions
Novel social motion representation for any group size
Comprehensive dataset with textual annotations for training
🔎 Similar Papers
No similar papers found.