Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment

📅 2024-03-17

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing no-reference image quality assessment (NR-IQA) methods heavily rely on human opinion scores (e.g., MOS), limiting their scalability and practical deployment. To address this, we propose QualiCLIP, the first self-supervised framework for learning quality representations without any human annotations. Our approach leverages CLIP’s vision-language alignment capability through three key innovations: (1) quality-aware antonymic text prompts (e.g., “blurry” vs. “sharp”) to guide contrastive image–text alignment; (2) synthetically generated multi-level degradation sequences that provide ordinal quality signals; and (3) a content-degradation dual consistency constraint to enforce perceptual coherence across samples. Evaluated on real-world distorted datasets, QualiCLIP significantly outperforms all opinion-agnostic NR-IQA methods and achieves cross-dataset generalization performance on par with—or even surpassing—that of state-of-the-art supervised approaches. The code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract

No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. The reliance on human-annotated Mean Opinion Score (MOS) in the majority of state-of-the-art NR-IQA approaches limits their scalability and broader applicability to real-world scenarios. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware method that does not require MOS. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate quality-aware image representations. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts. At the same time, we force CLIP to generate consistent representations for images with similar content and the same level of degradation. Our method significantly outperforms other opinion-unaware approaches on several datasets with authentic distortions. Moreover, despite not requiring MOS, QualiCLIP achieves state-of-the-art performance even when compared with supervised methods in cross-dataset experiments, thus proving to be suitable for application in real-world scenarios. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP.

Problem

Research questions and friction points this paper is trying to address.

Develops opinion-unaware image quality assessment without human annotations.

Introduces quality-aware image-text alignment for CLIP-based self-supervised learning.

Achieves generalization across datasets with diverse distortion types.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised CLIP-based quality assessment

Quality-aware image-text alignment strategy

Synthetic image degradation for training

🔎 Similar Papers

Mitigating Open-Vocabulary Caption Hallucinations