Revisiting Vision Language Foundations for No-Reference Image Quality Assessment

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This paper systematically evaluates the transferability of six mainstream vision backbones—CLIP, SigLIP2, DINOv2/v3, Perception, and ResNet—for no-reference image quality assessment (NR-IQA). To enhance generalization, we propose a channel-wise learnable activation function mechanism that automatically adapts nonlinear transformations per channel, replacing hand-crafted designs. Our method employs a lightweight MLP head and performs end-to-end fine-tuning of pretrained backbones. Extensive experiments on three major NR-IQA benchmarks—CLIVE, KADID10K, and AGIQA3K—achieve state-of-the-art performance across all datasets, with SigLIP2 delivering the best results. These findings underscore the critical synergy between vision-language models and adaptive activation functions for NR-IQA, establishing a new efficient and robust baseline.

Technology Category

Application Category

📝 Abstract

Large-scale vision language pre-training has recently shown promise for no-reference image-quality assessment (NR-IQA), yet the relative merits of modern Vision Transformer foundations remain poorly understood. In this work, we present the first systematic evaluation of six prominent pretrained backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet, for the task of No-Reference Image Quality Assessment (NR-IQA), each finetuned using an identical lightweight MLP head. Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of image quality assessment models. Notably, we find that simple sigmoid activations outperform commonly used ReLU and GELU on several benchmarks. Motivated by this finding, we introduce a learnable activation selection mechanism that adaptively determines the nonlinearity for each channel, eliminating the need for manual activation design, and achieving new state-of-the-art SRCC on CLIVE, KADID10K, and AGIQA3K. Extensive ablations confirm the benefits across architectures and regimes, establishing strong, resource-efficient NR-IQA baselines.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision transformer backbones for no-reference image quality assessment

Investigating activation function impact on IQA model generalization ability

Developing adaptive activation selection to eliminate manual design requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluates six pretrained vision backbones

Introduces learnable activation selection mechanism

Achieves state-of-the-art performance on multiple benchmarks

🔎 Similar Papers

Descriptive Image Quality Assessment in the Wild