PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Existing language models lack systematic evaluation in generating procedural video code that is both spatially accurate and temporally coherent. This work proposes PRISM—a large-scale bilingual instruction-to-code benchmark—and introduces a four-tier funnel-based evaluation framework assessing code executability, spatial layout correctness, Procedural Audiovisual Dynamic Visual Complexity (PADVC), and Temporal Density (TD). We uncover, for the first time, the “execution–spatial gap”: executable code does not necessarily yield spatially consistent visual outputs. Evaluations on seven mainstream large language models using real-world knowledge visualization scenarios reveal an average 41% drop in pass rates from execution to spatial correctness, highlighting critical deficiencies in spatiotemporal reasoning. Our contribution establishes the first systematic evaluation framework and reliable benchmark for procedural video generation.
📝 Abstract
Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.
Problem

Research questions and friction points this paper is trying to address.

programmatic video generation
spatial-temporal reasoning
language models
evaluation benchmark
spatial coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

programmatic video generation
spatial-temporal reasoning
execution-spatial gap
funnel-style evaluation
large-scale benchmark
Q
Qiran Zhang
School of Artificial Intelligence, Shanghai Jiao Tong University
Y
Yuheng Wang
School of Artificial Intelligence, Shanghai Jiao Tong University
R
Runde Yang
School of Artificial Intelligence, Shanghai Jiao Tong University
L
Lin Wu
School of Artificial Intelligence, Shanghai Jiao Tong University
J
Jingru Fan
School of Artificial Intelligence, Shanghai Jiao Tong University
Shu Yao
Shu Yao
Beijing University of Posts and Telecommunications
Artificial Intelligence
Jie Zhang
Jie Zhang
Shanghai Jiao Tong University
高能量密度物理
T
Tianle Zhou
School of Artificial Intelligence, Shanghai Jiao Tong University
H
Huatao Li
School of Artificial Intelligence, Shanghai Jiao Tong University
R
Ruijie Shi
School of Artificial Intelligence, Shanghai Jiao Tong University
Y
Yihan Li
School of Artificial Intelligence, Shanghai Jiao Tong University
Chen Qian
Chen Qian
Ph. D, Shanghai Jiao Tong University
interpretable AIIntelligent fault diagnosis.