Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing text-to-video (T2V) evaluation metrics emphasize visual quality and temporal coherence but neglect practical utility in downstream tasks—particularly text-to-video retrieval (TVR). Method: We introduce SynTVA, the first retrieval-oriented synthetic video benchmark, featuring a four-dimensional human-annotated semantic alignment framework (objects, scenes, actions, attributes, and prompt fidelity), and propose an extensible Auto-Evaluator for automated synthetic quality prediction. Contribution/Results: Experiments reveal weak correlation between conventional video quality metrics and TVR performance; conversely, videos with high semantic alignment significantly improve retrieval accuracy. By curating training sets using high-utility synthetic videos, we empirically validate their effectiveness in enhancing TVR models and establish a scalable, task-aligned evaluation paradigm for synthetic video generation.

Technology Category

Application Category

📝 Abstract

Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object & Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.

Problem

Research questions and friction points this paper is trying to address.

Evaluates synthetic videos' utility in retrieval tasks

Measures semantic alignment in video-text pairs

Develops Auto-Evaluator for scaling alignment assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SynTVA dataset for synthetic video evaluation

Uses Auto-Evaluator to estimate semantic alignment quality

Enhances retrieval models with high-utility synthetic samples

🔎 Similar Papers

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs