🤖 AI Summary
Existing AI-generated video quality assessment methods exhibit significant limitations in spatial fidelity, temporal coherence, and text-video alignment—particularly in detecting anomalous motions and semantically implausible content. To address this, we introduce LG-VQ, the first multi-dimensional subjective benchmark specifically designed for AIGC videos, covering spatial, temporal, and text-video alignment dimensions. We further propose UGVQ, the first unified multimodal evaluation model for AIGC videos, which innovatively integrates visual appearance, motion dynamics, and CLIP-style text-video alignment features to better capture typical AIGC distortions. Leveraging large-scale human annotations and spatiotemporal modeling, UGVQ achieves state-of-the-art performance across all three dimensions on LG-VQ. Both the LG-VQ dataset and the UGVQ model are publicly released, establishing the first comprehensive, open-source toolkit for AIGC video quality assessment.
📝 Abstract
In recent years, artificial intelligence (AI)-driven video generation has gained significant attention. Consequently, there is a growing need for accurate video quality assessment (VQA) metrics to evaluate the perceptual quality of AI-generated content (AIGC) videos and optimize video generation models. However, assessing the quality of AIGC videos remains a significant challenge because these videos often exhibit highly complex distortions, such as unnatural actions and irrational objects. To address this challenge, we systematically investigate the AIGC-VQA problem, considering both subjective and objective quality assessment perspectives. For the subjective perspective, we construct the Large-scale Generated Video Quality assessment (LGVQ) dataset, consisting of 2,808 AIGC videos generated by 6 video generation models using 468 carefully curated text prompts. We evaluate the perceptual quality of AIGC videos from three critical dimensions: spatial quality, temporal quality, and text-video alignment. For the objective perspective, we establish a benchmark for evaluating existing quality assessment metrics on the LGVQ dataset. Our findings show that current metrics perform poorly on this dataset, highlighting a gap in effective evaluation tools. To bridge this gap, we propose the Unify Generated Video Quality assessment (UGVQ) model, designed to accurately evaluate the multi-dimensional quality of AIGC videos. The UGVQ model integrates the visual and motion features of videos with the textual features of their corresponding prompts, forming a unified quality-aware feature representation tailored to AIGC videos. Experimental results demonstrate that UGVQ achieves state-of-the-art performance on the LGVQ dataset across all three quality dimensions. Both the LGVQ dataset and the UGVQ model are publicly available on https://github.com/zczhang-sjtu/UGVQ.git.