Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

๐Ÿ“… 2026-04-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

205K/year
๐Ÿค– AI Summary
Existing multi-agent video generation methods, while visually appealing, often suffer from unreliable semantics and lack executable structured representations. This work proposes the Generative Spatio-Temporal Event Graph (GEST) framework, which leverages large language models (LLMs) to decouple narrative planning from physical execution through a separation-of-concerns architecture: an LLM handles high-level storytelling, while a procedural backend integrated with verification tools enforces simulation constraints to ensure semantically accurate and physically plausible outputs. The system employs a hierarchical dual-agent design (director and scene builder), a turn-driven state machine, and relational sub-agents to populate semantic edges, with deterministic rendering in a 3D game engine. Experiments show that GEST outperforms baselines on 79% of texts and 74% of videos in automatic evaluation; human assessments further reveal a physical validity rate of 58%โ€”substantially higher than VEO 3.1 (25%) and WAN 2.2 (20%)โ€”and a semantic alignment score of 3.75/5 versus 2.33 and 1.50 for the respective baselines.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) -- a structured specification of actors, actions, objects, and temporal constraints -- which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture -- a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine -- with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).
Problem

Research questions and friction points this paper is trying to address.

video generation
semantic reliability
executable specification
physical validity
event graph
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Video Generation
Graph of Events in Space and Time (GEST)
Tool-Constrained LLM Planning
Executable Specification
3D Simulation-based Generation
๐Ÿ”Ž Similar Papers