Gen-C: Populating Virtual Worlds with Generative Crowds

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing virtual crowd simulation methods primarily focus on low-level motion behaviors (e.g., obstacle avoidance, flocking), failing to model high-level social interactions among agents or between humans and the environment. Method: We propose the first end-to-end framework integrating large language models (LLMs) with graph generation models to automatically synthesize dynamic, interactive virtual crowds from natural language descriptions. Our approach introduces a novel language-guided temporal graph generation paradigm that decouples topological relational modeling from node-level action feature learning, incorporating variational graph autoencoders (VGAEs), conditional prior networks, and multimodal behavioral representation learning. Contribution/Results: The framework generalizes to long-horizon, high-interaction-density social behaviors without requiring annotated crowd videos. Evaluated on campus and railway station scenarios, it demonstrates strong text-driven zero-shot synthesis capability, significantly enhancing immersion in virtual worlds and improving content generation efficiency.

Technology Category

Application Category

📝 Abstract

Over the past two decades, researchers have made significant advancements in simulating human crowds, yet these efforts largely focus on low-level tasks like collision avoidance and a narrow range of behaviors such as path following and flocking. However, creating compelling crowd scenes demands more than just functional movement-it requires capturing high-level interactions between agents, their environment, and each other over time. To address this issue, we introduce Gen-C, a generative model to automate the task of authoring high-level crowd behaviors. Gen-C bypasses the labor-intensive and challenging task of collecting and annotating real crowd video data by leveraging a large language model (LLM) to generate a limited set of crowd scenarios, which are subsequently expanded and generalized through simulations to construct time-expanded graphs that model the actions and interactions of virtual agents. Our method employs two Variational Graph Auto-Encoders guided by a condition prior network: one dedicated to learning a latent space for graph structures (agent interactions) and the other for node features (agent actions and navigation). This setup enables the flexible generation of dynamic crowd interactions. The trained model can be conditioned on natural language, empowering users to synthesize novel crowd behaviors from text descriptions. We demonstrate the effectiveness of our approach in two scenarios, a University Campus and a Train Station, showcasing its potential for populating diverse virtual environments with agents exhibiting varied and dynamic behaviors that reflect complex interactions and high-level decision-making patterns.

Problem

Research questions and friction points this paper is trying to address.

Automating high-level crowd behavior generation in virtual worlds

Overcoming labor-intensive real crowd data collection via LLM-based simulation

Enabling dynamic crowd interactions through variational graph auto-encoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLM to generate crowd scenarios

Uses Variational Graph Auto-Encoders for interactions

Conditions model on natural language inputs

🔎 Similar Papers

CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation