🤖 AI Summary
Existing evaluation methods for AI-generated videos (AIGVs) lack holistic, interpretable, and quantitative quality assessment. Method: We construct a benchmark dataset of nearly 10,000 samples and propose the first unified framework jointly modeling scalar quality scores and natural-language attribution explanations. Our approach integrates SlowFast-based multi-scale video encoding with three-dimensional fine-grained annotations—visual fidelity, motion realism, and text–video alignment—and employs a multi-stage training strategy: chain-of-thought–guided supervised fine-tuning (SFT), grouped relative policy optimization (GRPO) reinforcement learning, and iterative SFT refinement. Contribution/Results: The resulting model achieves state-of-the-art performance in quality prediction while generating human-interpretable, natural-language justifications. It significantly improves stability and alignment with human preferences, offering a transparent, trustworthy paradigm for multimodal generative evaluation.
📝 Abstract
We present Q-Save, a new benchmark dataset and model for holistic and explainable evaluation of AI-generated video (AIGV) quality. The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels along three core dimensions: visual quality, dynamic quality, and text-video alignment. These multi-aspect annotations enable both accurate quality assessment and interpretable reasoning behind the scores. To leverage this data, we propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation. The model adopts the SlowFast framework to distinguish between fast frames and slow frames - slow frames are processed with high resolution while fast frames use low resolution, balancing evaluation accuracy and computational efficiency. For training, we use data formatted in Chain-of-Thought (COT) style and employ a multi-stage strategy: we first conduct Supervised Fine-Tuning (SFT), then further enhance the model with Grouped Relative Policy Optimization (GRPO), and finally perform SFT again to improve model stability. Experimental results demonstrate that our model achieves state-of-the-art performance in video quality prediction while also providing human-aligned, interpretable justifications. Our dataset and model establish a strong foundation for explainable evaluation in generative video research, contributing to the development of multimodal generation and trustworthy AI. Code and dataset will be released upon publication.