Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing CLIP-based no-reference image quality assessment (IQA) methods rely solely on cosine similarity between image and text embeddings, overlooking the strong empirical correlation between CLIP image feature magnitudes and perceptual quality. Method: This work is the first to empirically reveal and validate this magnitude–quality relationship. We propose a fine-tuning-free, adaptive dual-cue fusion framework: (i) a semantic-normalized magnitude cue, modeled via Box-Cox transformation and absolute-value aggregation of CLIP visual features; and (ii) a confidence-guided dynamic fusion mechanism that jointly leverages magnitude and cosine similarity cues. Contribution/Results: Evaluated on LIVE, KonIQ-10k, and other benchmark IQA datasets, our method significantly outperforms standard CLIP and state-of-the-art training-free IQA models, demonstrating strong generalization across distortion types and quality ranges. This work establishes a new paradigm for exploiting deeper CLIP feature semantics—beyond alignment—within perceptual quality assessment.

Technology Category

Application Category

📝 Abstract

Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as"a good photo"or"a bad photo."However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

Problem

Research questions and friction points this paper is trying to address.

Enhancing CLIP-based image quality assessment with magnitude-aware features

Integrating semantic similarity with statistically normalized feature magnitudes

Developing adaptive fusion of quality cues without task-specific training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Magnitude-aware quality cue complements cosine similarity

Box-Cox transformation normalizes CLIP feature distribution

Confidence-guided fusion adaptively weighs quality cues

🔎 Similar Papers

Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment