SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video (T2V) models generate high-fidelity single frames but exhibit limited capability in multi-event logical progression, long-range temporal consistency, and narrative coherence, compounded by the absence of dedicated evaluation benchmarks. To address this, we propose SeqBench—the first comprehensive T2V benchmark explicitly designed to assess sequential narrative coherence—comprising 320 complex narrative prompts and 2,560 human-annotated videos. We introduce Dynamic Temporal Graphs (DTG), a novel automated metric that precisely models long-distance temporal dependencies and causal logic. Additionally, we establish a multidimensional narrative complexity framework and a fine-grained human annotation schema. Experiments demonstrate strong correlation between DTG scores and human judgments (Spearman’s ρ > 0.85) and systematically expose critical failures of state-of-the-art models in object state consistency, multi-object physical plausibility, and action sequencing logic.

Technology Category

Application Category

📝 Abstract
Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences. To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations. Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models. Please refer to https://videobench.github.io/SeqBench.github.io/ for more details.
Problem

Research questions and friction points this paper is trying to address.

Evaluating sequential narrative coherence in text-to-video generation models
Addressing limitations in maintaining logical event progression across videos
Developing automated metrics for long-range temporal dependencies in narratives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Temporal Graphs metric for narrative evaluation
Human-annotated dataset with 320 narrative prompts
Systematic framework assessing sequential reasoning capabilities
🔎 Similar Papers
No similar papers found.
Z
Zhengxu Tang
Department of Electrical and Computer Engineering, University of Michigan, Ann Arbor, United States
Z
Zizheng Wang
Department of Mechanical and Industrial Engineering, Northeastern University, Boston, United States
Luning Wang
Luning Wang
University of Michigan, Ann-Arbor
Artificial Intelligence
Zitao Shuai
Zitao Shuai
UCLA; University of Michigan
C
Chenhao Zhang
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, United States
S
Siyu Qian
School of Engineering and Applied Sciences, Harvard University, Cambridge, United States
Yirui Wu
Yirui Wu
School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing, China
Bohao Wang
Bohao Wang
College of Information Science & Electronic Engineering, Zhejiang University
Wireless AICommunication6GDigital TwinRay Tracing
H
Haosong Rao
Georgen Institute for Data Science, University of Rochester, Rochester, United States
Z
Zhenyu Yang
School of Earth Sciences, Zhejiang University, Hangzhou, China
C
Chenwei Wu
Georgen Institute for Data Science, University of Rochester, Rochester, United States