🤖 AI Summary
Existing CLIP-based no-reference image quality assessment (IQA) methods rely solely on cosine similarity between image and text embeddings, overlooking the strong empirical correlation between CLIP image feature magnitudes and perceptual quality. Method: This work is the first to empirically reveal and validate this magnitude–quality relationship. We propose a fine-tuning-free, adaptive dual-cue fusion framework: (i) a semantic-normalized magnitude cue, modeled via Box-Cox transformation and absolute-value aggregation of CLIP visual features; and (ii) a confidence-guided dynamic fusion mechanism that jointly leverages magnitude and cosine similarity cues. Contribution/Results: Evaluated on LIVE, KonIQ-10k, and other benchmark IQA datasets, our method significantly outperforms standard CLIP and state-of-the-art training-free IQA models, demonstrating strong generalization across distortion types and quality ranges. This work establishes a new paradigm for exploiting deeper CLIP feature semantics—beyond alignment—within perceptual quality assessment.
📝 Abstract
Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as"a good photo"or"a bad photo."However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.