Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video quality assessment (VQA) methods suffer from insufficient semantic transfer and prohibitively high pretraining computational overhead. To address these challenges, this paper proposes the first vision-language model (VLM)-driven VQA framework, featuring a shared cross-modal adapter, a five-level learnable quality prompt, and a frame-difference-aware sampling strategy—enabling lightweight, efficient quality representation learning. With fewer than 0.1% trainable parameters, the model accurately captures fine-grained quality distortions. Extensive experiments on multiple mainstream VQA benchmarks demonstrate that our method significantly outperforms both supervised and unsupervised state-of-the-art approaches. It reduces training cost by over 90% while substantially improving cross-dataset generalization. This work provides the first empirical validation of VLMs’ effectiveness and practicality for end-to-end VQA tasks.

Technology Category

Application Category

📝 Abstract
Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.
Problem

Research questions and friction points this paper is trying to address.

Improving video quality assessment using vision-language models
Reducing computational cost in VQA model training
Enhancing sensitivity to subtle video quality variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared Cross-Modal Adapter enhances representations
Learnable quality prompts improve sensitivity
Frame-difference sampling boosts generalization
🔎 Similar Papers
No similar papers found.
Yachun Mi
Yachun Mi
Harbin Institute of Technology
video quality assessmentvideo value analysisMultimodal fusion
Y
Yu Li
Harbin Institute of Technology
Y
Yanting Li
Harbin Institute of Technology
S
Shixin Sun
Harbin Institute of Technology
Chen Hui
Chen Hui
Harbin Institute of Technology & Nanyang Technological University
image compressionquality assessmentmultimedia securityimage and video processing
T
Tong Zhang
Harbin Institute of Technology
Y
Yuanyuan Liu
China University of Geosciences (Wuhan)
Chenyue Song
Chenyue Song
Harbin Institute of Technology
S
Shaohui Liu
Harbin Institute of Technology