🤖 AI Summary
This work addresses the challenge of perceptual quality assessment for user-generated short videos, which suffer from complex generation pipelines, rapid content evolution, and mixed distortions. To tackle this, the authors propose an end-to-end video quality assessment framework that introduces, for the first time, a frequency-domain compression prior to generate artifact- and structure-sensitive weight maps. Built upon a dense CLIP visual encoder, the method employs a learnable gating mechanism to adaptively fuse artifact-aware, structure-aware, and raw visual features. By explicitly disentangling multidimensional distortion cues and dynamically aggregating them through learned weights, the framework achieves significantly improved perceptual accuracy on short video datasets, attaining SRCC of 0.736 and PLCC of 0.787, while maintaining computational efficiency during inference.
📝 Abstract
Short-form video poses new challenges to the quality assessment of user-generated content (UGC) due to its complex generation pipeline, rapid content variation, and mixed distortions. To address this challenge, we propose an end-to-end video quality assessment (VQA) framework that employs a dense visual encoder based on CLIP, and incorporates compression priors derived from the frequency domain to generate artifact- and structure-aware weight maps for feature aggregation. By explicitly decomposing artifact, structure, and original visual feature branches and adaptively fusing them over time through a learned gating module, the proposed method achieves accurate and efficient quality prediction. Experimental results show that our method achieves strong performance on short-form video datasets in terms of average rank and linear correlation (SRCC: 0.736, PLCC: 0.787), while maintaining efficient inference runtime. The code and additional results are available at: https://github.com/xinyiW915/FGSVQA.