DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing video quality assessment (VQA) methods struggle to align with human perception in real-world scenarios, primarily due to limited representational capacity and insufficient diversity in training data. To address this, we propose DiffVQA—a novel framework that (1) pioneers the use of pretrained diffusion models as general-purpose semantic and distortion feature extractors for VQA; (2) introduces a dual-branch reconstruction control module (employing resize and crop operations) to enhance robustness against spatial distortions; and (3) incorporates a parallel Mamba architecture to efficiently model long-range temporal consistency. DiffVQA fuses multi-scale features and performs end-to-end quality regression. It achieves state-of-the-art in-domain performance across multiple benchmarks and significantly improves cross-dataset generalization, outperforming mainstream CNN- and ViT-based baselines. Our core contributions include: the first adaptation of diffusion models to VQA; controllable, dual-path feature disentanglement; and Mamba-driven, computationally efficient long-term temporal modeling.

Technology Category

Application Category

📝 Abstract

Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences. Despite the promising performance of existing methods using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), they often struggle to align closely with human perceptions, particularly in diverse real-world scenarios. This challenge is exacerbated by the limited scale and diversity of available datasets. To address this limitation, we introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets. Our framework adapts these models to reconstruct identical input frames through a control module. The adapted diffusion model is then used to extract semantic and distortion features from a resizing branch and a cropping branch, respectively. To enhance the model's ability to handle long-term temporal dynamics, a parallel Mamba module is introduced, which extracts temporal coherence augmented features that are merged with the diffusion features to predict the final score. Experiments across multiple datasets demonstrate DiffVQA's superior performance on intra-dataset evaluations and its exceptional generalization across datasets. These results confirm that leveraging a diffusion model as a feature extractor can offer enhanced VQA performance compared to CNN and ViT backbones.

Problem

Research questions and friction points this paper is trying to address.

Improving video quality assessment alignment with human perceptions

Addressing limited dataset scale and diversity in VQA

Enhancing temporal dynamics handling in video quality evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion models for feature extraction

Integrates Mamba module for temporal dynamics

Combines resizing and cropping branch features

🔎 Similar Papers

ReLaX-VQA: Residual Fragment and Layer Stack Extraction for Enhancing Video Quality Assessment