Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing video quality assessment (VQA) methods suffer from insufficient semantic transfer and prohibitively high pretraining computational overhead. To address these challenges, this paper proposes the first vision-language model (VLM)-driven VQA framework, featuring a shared cross-modal adapter, a five-level learnable quality prompt, and a frame-difference-aware sampling strategy—enabling lightweight, efficient quality representation learning. With fewer than 0.1% trainable parameters, the model accurately captures fine-grained quality distortions. Extensive experiments on multiple mainstream VQA benchmarks demonstrate that our method significantly outperforms both supervised and unsupervised state-of-the-art approaches. It reduces training cost by over 90% while substantially improving cross-dataset generalization. This work provides the first empirical validation of VLMs’ effectiveness and practicality for end-to-end VQA tasks.

Technology Category

Application Category

📝 Abstract

Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.

Problem

Research questions and friction points this paper is trying to address.

Improving video quality assessment using vision-language models

Reducing computational cost in VQA model training

Enhancing sensitivity to subtle video quality variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared Cross-Modal Adapter enhances representations

Learnable quality prompts improve sensitivity

Frame-difference sampling boosts generalization

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs