Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing CLIP-based no-reference image quality assessment (IQA) methods rely solely on cosine similarity between image and text embeddings, overlooking the strong empirical correlation between CLIP image feature magnitudes and perceptual quality. Method: This work is the first to empirically reveal and validate this magnitude–quality relationship. We propose a fine-tuning-free, adaptive dual-cue fusion framework: (i) a semantic-normalized magnitude cue, modeled via Box-Cox transformation and absolute-value aggregation of CLIP visual features; and (ii) a confidence-guided dynamic fusion mechanism that jointly leverages magnitude and cosine similarity cues. Contribution/Results: Evaluated on LIVE, KonIQ-10k, and other benchmark IQA datasets, our method significantly outperforms standard CLIP and state-of-the-art training-free IQA models, demonstrating strong generalization across distortion types and quality ranges. This work establishes a new paradigm for exploiting deeper CLIP feature semantics—beyond alignment—within perceptual quality assessment.

Technology Category

Application Category

📝 Abstract
Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as"a good photo"or"a bad photo."However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.
Problem

Research questions and friction points this paper is trying to address.

Enhancing CLIP-based image quality assessment with magnitude-aware features
Integrating semantic similarity with statistically normalized feature magnitudes
Developing adaptive fusion of quality cues without task-specific training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Magnitude-aware quality cue complements cosine similarity
Box-Cox transformation normalizes CLIP feature distribution
Confidence-guided fusion adaptively weighs quality cues
🔎 Similar Papers
No similar papers found.
Z
Zhicheng Liao
School of Computer Science, South China Normal University, China
D
Dongxu Wu
School of Computer Science, South China Normal University, China
Z
Zhenshan Shi
School of Computer Science, South China Normal University, China
S
Sijie Mai
School of Computer Science, South China Normal University, China
H
Hanwei Zhu
School of Computer Science and Engineering, Nanyang Technological University, Singapore
L
Lingyu Zhu
School of Computer Science, City University of Hong Kong, China
Yuncheng Jiang
Yuncheng Jiang
West China Hospital, Sichuan University
Computer VisionMedical Image Analysis
B
Baoliang Chen
School of Computer Science, South China Normal University, China