Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge in no-reference image quality assessment (NR-IQA) of real-world distorted images, where complex distortions, high annotation costs, and limited data hinder the simultaneous achievement of accuracy and scalability. To tackle this issue, we propose the Global-Local Interaction Adapter (GLIA), a novel approach built upon a pretrained Vision Transformer. GLIA employs a dual-stream architecture combined with a lightweight adapter-based fine-tuning strategy, featuring an adaptive interaction mechanism that effectively fuses global semantic and local detail features. This design significantly reduces the number of trainable parameters while enhancing the model’s perceptual capability and generalization across diverse real-world distortions. Extensive experiments demonstrate that GLIA achieves state-of-the-art performance on multiple mainstream IQA benchmarks, confirming its superior prediction accuracy and robustness.

📝 Abstract

In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.

Problem

Research questions and friction points this paper is trying to address.

Blind Image Quality Assessment

authentically distorted images

pre-trained Vision Transformers

computational demands

fine-tuning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-Local Interaction

Vision Transformer

Blind Image Quality Assessment