Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge that video blur significantly degrades classification performance, while no-reference video quality assessment (VQA) remains difficult to model due to the absence of ground-truth labels. To this end, the authors propose SSL-V3, a novel approach that integrates no-reference VQA into the video classification pipeline via a joint self-supervised learning framework. By leveraging the classification task to inversely optimize VQA parameters and dynamically modulating classification features with estimated quality scores, SSL-V3 enables quality-aware robust recognition without requiring VQA ground-truth annotations. Built upon a contrastive learning-based Video Vision Transformer, the method demonstrates strong empirical performance, achieving a classification accuracy of 94.87% on interview videos in the I-CONECT dataset and validating its effectiveness across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Video Quality Assessment

Video Classification

Label Shortage

No-reference VQA

Video Recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning

Video Vision Transformer

No-reference VQA