🤖 AI Summary
Existing language models lack systematic evaluation in generating procedural video code that is both spatially accurate and temporally coherent. This work proposes PRISM—a large-scale bilingual instruction-to-code benchmark—and introduces a four-tier funnel-based evaluation framework assessing code executability, spatial layout correctness, Procedural Audiovisual Dynamic Visual Complexity (PADVC), and Temporal Density (TD). We uncover, for the first time, the “execution–spatial gap”: executable code does not necessarily yield spatially consistent visual outputs. Evaluations on seven mainstream large language models using real-world knowledge visualization scenarios reveal an average 41% drop in pass rates from execution to spatial correctness, highlighting critical deficiencies in spatiotemporal reasoning. Our contribution establishes the first systematic evaluation framework and reliable benchmark for procedural video generation.
📝 Abstract
Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.