VideoScore2: Think before You Score in Generative Video Evaluation

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video evaluation methods produce only a single, uninterpretable scalar score, failing to comprehensively characterize multi-dimensional quality attributes—such as visual fidelity, semantic alignment, and physical plausibility. Method: We propose the first interpretable, multi-dimensional video assessment framework aligned with human preferences. It establishes a three-axis evaluation taxonomy covering visual quality, text–video alignment, and physical commonsense consistency. Leveraging VideoFeedback2—a newly curated dataset of human-annotated reasoning trajectories—we adopt a two-stage training paradigm: supervised fine-tuning followed by Groupwise Relative Policy Optimization (GRPO), enabling chain-of-thought–driven generation of fine-grained feedback comments and multi-dimensional scores. Contribution/Results: Our method achieves 44.35 accuracy (+5.94) on VideoScore-Bench-v2 and an average score of 50.37 (+4.32) across four external benchmarks—outperforming state-of-the-art methods significantly—and supports Best-of-N generation optimization.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-faceted quality of generated videos
Providing interpretable assessments beyond single scores
Addressing visual quality, semantic alignment, and physical consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dimensional interpretable framework for video evaluation
Two-stage training with supervised fine-tuning and GRPO
Human-aligned assessments with chain-of-thought rationales