T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

📅 2024-07-19

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 1

career value

185K/year

🤖 AI Summary

Current text-to-video (T2V) generation lacks systematic evaluation of compositional capabilities—including attribute binding, spatial relations, motion modeling, object interaction, and numerical understanding. Method: We introduce T2V-CompBench, the first compositional benchmark for T2V, comprising 1,400 structured prompts across seven dimensions. We formally define and quantify compositional reasoning in T2V and propose a multi-granularity evaluation framework integrating multimodal large language model (MLLM) reasoning, object detection, and tracking. Results: Human evaluation confirms high correlation with our automated metrics. Comprehensive assessment of leading T2V models reveals pervasive deficiencies in compositional reasoning. We publicly release the benchmark data, evaluation code, and analysis framework to standardize and enhance reproducibility in T2V compositional research.

Technology Category

Application Category

📝 Abstract

Text-to-video (T2V) generative models have advanced significantly, yet their ability to compose different objects, attributes, actions, and motions into a video remains unexplored. Previous text-to-video benchmarks also neglect this important ability for evaluation. In this work, we conduct the first systematic study on compositional text-to-video generation. We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation. T2V-CompBench encompasses diverse aspects of compositionality, including consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We further carefully design evaluation metrics of multimodal large language model (MLLM)-based, detection-based, and tracking-based metrics, which can better reflect the compositional text-to-video generation quality of seven proposed categories with 1400 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and various compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope our attempt could shed light on future research in this direction.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Video Conversion

Complex Combinatorics

Technological Limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

T2V-CompBench

combinatorial quality evaluation

video generation models analysis

🔎 Similar Papers

No similar papers found.