CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Video Quality Assessment (VQA) faces two key challenges: conventional models lack effective video-level contextual modeling, while existing large language model (LLM)-based approaches exhibit limited sensitivity to subtle pixel-level distortions and treat quality scoring and natural language description generation as disjoint tasks. To address these issues, we propose the first dual-visual-encoder multimodal framework that separately encodes high-level semantic context and low-level pixel-level distortions, integrating them via a unified language decoder for joint reasoning. Our method employs end-to-end multi-task learning—simultaneously optimizing score regression, descriptive text generation, and pairwise quality comparison. Evaluated on multiple standard benchmarks, our approach achieves state-of-the-art performance, demonstrating significantly improved sensitivity to compression artifacts and other fine-grained distortions, as well as enhanced cross-domain generalization robustness.

Technology Category

Application Category

📝 Abstract

Video quality assessment (VQA) is a challenging research topic with broad applications. Effective VQA necessitates sensitivity to pixel-level distortions and a comprehensive understanding of video context to accurately determine the perceptual impact of distortions. Traditional hand-crafted and learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent LLM-based models struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context and Pixel aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). The model is trained via a multi-task pipeline optimizing for score prediction, description generation, and pairwise comparisons. Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on established VQA benchmarks and superior robustness to pixel distortions, confirming its efficacy for comprehensive and practical video quality assessment in real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in video quality assessment models

Combines pixel-level and contextual understanding for VQA

Improves sensitivity to small distortions and unified scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual vision encoders for high and low-level analysis

Multimodal LLM combining context and pixel awareness

Multi-task training for score and description generation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs