SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the lack of reliable and scalable meta-evaluation benchmarks for text-to-long-video (T2V) generation, which hinders validation of existing metrics’ ranking accuracy in human-judgable scenarios. The authors propose SLVMEval, the first synthetic meta-evaluation benchmark tailored for videos up to three hours long. It constructs high–low quality video pairs by controllably degrading source videos along ten distinct dimensions, leveraging a densely annotated description dataset and a human-perception filtering mechanism. Experimental results under a pairwise comparison framework show that humans achieve 84.7%–96.8% accuracy in identifying superior videos, whereas current evaluation systems perform significantly worse than humans across nine of the ten dimensions, revealing their limited reliability in long-video assessment.

Technology Category

Application Category

📝 Abstract

This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

Problem

Research questions and friction points this paper is trying to address.

text-to-video evaluation

long video generation

meta-evaluation

video quality assessment

synthetic benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-to-video evaluation

long-video generation

synthetic benchmark