CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the first vision-language-action (VLA) integrated, context-aware framework for crowd simulation, addressing the limitations of traditional methods that focus solely on geometric obstacle avoidance while neglecting spatial semantics, social norms, and behavioral consequences. The approach models pedestrians as embodied agents capable of perceiving visual scenes and interpreting language instructions to understand contextual semantics. Decision-making is guided by a counterfactual exploration-based question-answering mechanism that enables consequence-aware behavior. By combining agent-centric visual supervision, a LoRA-finetuned pretrained vision-language model, and a hybrid symbolic-continuous skill action space, the framework achieves a paradigm shift from motion synthesis to intention-driven simulation within semantically reconstructed environments. The resulting crowd exhibits not only realistic motion but also contextually plausible behaviors with explicit intent, significantly enhancing both realism and semantic richness.
📝 Abstract
Crowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is plausible but rarely intentional. We introduce CrowdVLA, a new formulation of crowd simulation that models each pedestrian as a Vision-Language-Action (VLA) agent. Instead of replaying recorded trajectories, CrowdVLA enables agents to interpret scene semantics and social norms from visual observations and language instructions, and to select actions through consequence-aware reasoning. CrowdVLA addresses three key challenges-limited agent-centric supervision in crowd datasets, unstable per-frame control, and success-biased datasets-through: (i) agent-centric visual supervision via semantically reconstructed environments and Low-Rank Adaptation (LoRA) fine-tuning of a pretrained vision-language model, (ii) a motion skill action space that bridges symbolic decision making and continuous locomotion, and (iii) exploration-based question answering that exposes agents to counterfactual actions and their outcomes through simulation rollouts. Our results shift crowd simulation from motion-centric synthesis toward perception-driven, consequence-aware decision making, enabling crowds that move not just realistically, but meaningfully.
Problem

Research questions and friction points this paper is trying to address.

crowd simulation
context-aware navigation
intentional behavior
social norms
embodied agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
crowd simulation
consequence-aware reasoning
LoRA fine-tuning
embodied agents
J
Juyeong Hwang
IIIXR Lab, Korea University, South Korea
S
Seong-Eun Hong
IIIXR Lab, Korea University, South Korea
J
Jinhyun Kim
IIIXR Lab, Korea University, South Korea
J
JaeYoung Seon
IIIXR Lab, Kyung Hee University, South Korea
Giljoo Nam
Giljoo Nam
Meta
Computer VisionComputer GraphicsMachine Learning
H
Hanyoung Jang
NC AI, South Korea
HyeongYeop Kang
HyeongYeop Kang
Assistant Professor of Korea University
Neural Computer GraphicsExtended RealityArtificial IntelligenceHuman-computer Interaction