Foundation Models Boost Low-Level Perceptual Similarity Metrics

📅 2024-09-11
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the misalignment between model predictions and human visual perception in full-reference image quality assessment (FR-IQA), this paper proposes a zero-shot, fine-tuning-free feature distance metric. Instead of relying solely on final-layer outputs or embedding vectors, the method systematically exploits intermediate-layer features from pre-trained vision foundation models (e.g., ViT or CNN). Quality scores are computed directly via parameter-free distances—Euclidean distance or cosine similarity—between corresponding intermediate feature maps. This work is the first to empirically demonstrate that intermediate-layer features strongly encode low-level perceptual similarity, challenging the conventional FR-IQA paradigm that depends either on end-to-end learning or high-level semantic features. Evaluated on multiple standard benchmarks, the proposed method achieves state-of-the-art performance without any training—outperforming classical metrics (PSNR, SSIM) and recent learning-based approaches (e.g., DISTS, PieAPP).

Technology Category

Application Category

📝 Abstract
For full-reference image quality assessment (FR-IQA) using deep-learning approaches, the perceptual similarity score between a distorted image and a reference image is typically computed as a distance measure between features extracted from a pretrained CNN or more recently, a Transformer network. Often, these intermediate features require further fine-tuning or processing with additional neural network layers to align the final similarity scores with human judgments. So far, most IQA models based on foundation models have primarily relied on the final layer or the embedding for the quality score estimation. In contrast, this work explores the potential of utilizing the intermediate features of these foundation models, which have largely been unexplored so far in the design of low-level perceptual similarity metrics. We demonstrate that the intermediate features are comparatively more effective. Moreover, without requiring any training, these metrics can outperform both traditional and state-of-the-art learned metrics by utilizing distance measures between the features.
Problem

Research questions and friction points this paper is trying to address.

Image Quality Assessment
Human Perception Alignment
Intermediate Layer Representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Middle Layer Features
Image Quality Assessment
Perceptual Similarity