๐ค AI Summary
Existing multi-agent video generation methods, while visually appealing, often suffer from unreliable semantics and lack executable structured representations. This work proposes the Generative Spatio-Temporal Event Graph (GEST) framework, which leverages large language models (LLMs) to decouple narrative planning from physical execution through a separation-of-concerns architecture: an LLM handles high-level storytelling, while a procedural backend integrated with verification tools enforces simulation constraints to ensure semantically accurate and physically plausible outputs. The system employs a hierarchical dual-agent design (director and scene builder), a turn-driven state machine, and relational sub-agents to populate semantic edges, with deterministic rendering in a 3D game engine. Experiments show that GEST outperforms baselines on 79% of texts and 74% of videos in automatic evaluation; human assessments further reveal a physical validity rate of 58%โsubstantially higher than VEO 3.1 (25%) and WAN 2.2 (20%)โand a semantic alignment score of 3.75/5 versus 2.33 and 1.50 for the respective baselines.
๐ Abstract
Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) -- a structured specification of actors, actions, objects, and temporal constraints -- which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture -- a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine -- with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).