ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing evaluation metrics for text-to-video (T2V) generation—e.g., CLIPScore—provide only coarse-grained semantic alignment scores, failing to capture fine-grained spatiotemporal consistency and exhibiting substantial deviation from human preferences. Method: We propose the first fine-grained, question-answering–based T2V alignment evaluation framework. It employs multi-agent parsing of text prompts, constructs atomic questions guided by semantic scene graphs, and leverages a knowledge-augmented video large language model (Video-LLM) for multi-stage commonsense reasoning. Contribution/Results: We introduce the first large-scale T2V alignment benchmark (2K prompts, 12K questions), enabling interpretable, fine-grained assessment. Experiments show our metric achieves a Spearman correlation of 58.47% with human preferences—significantly outperforming CLIPScore (31.0%). Comprehensive evaluation of 15 state-of-the-art T2V models reveals their capability boundaries and shared limitations.

Technology Category

Application Category

📝 Abstract

Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating fine-grained text-video semantic alignment

Improving coarse-grained metrics like CLIPScore

Assessing human preference in T2V generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained question generation via semantic scene graphs

Knowledge-augmented multi-stage reasoning for question answering

Comprehensive benchmark with diverse prompts and questions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs