🤖 AI Summary
Video Quality Assessment (VQA) faces two key challenges: conventional models lack effective video-level contextual modeling, while existing large language model (LLM)-based approaches exhibit limited sensitivity to subtle pixel-level distortions and treat quality scoring and natural language description generation as disjoint tasks. To address these issues, we propose the first dual-visual-encoder multimodal framework that separately encodes high-level semantic context and low-level pixel-level distortions, integrating them via a unified language decoder for joint reasoning. Our method employs end-to-end multi-task learning—simultaneously optimizing score regression, descriptive text generation, and pairwise quality comparison. Evaluated on multiple standard benchmarks, our approach achieves state-of-the-art performance, demonstrating significantly improved sensitivity to compression artifacts and other fine-grained distortions, as well as enhanced cross-domain generalization robustness.
📝 Abstract
Video quality assessment (VQA) is a challenging research topic with broad applications. Effective VQA necessitates sensitivity to pixel-level distortions and a comprehensive understanding of video context to accurately determine the perceptual impact of distortions. Traditional hand-crafted and learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent LLM-based models struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context and Pixel aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). The model is trained via a multi-task pipeline optimizing for score prediction, description generation, and pairwise comparisons. Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on established VQA benchmarks and superior robustness to pixel distortions, confirming its efficacy for comprehensive and practical video quality assessment in real-world scenarios.