IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the limited perceptual scope of existing large multimodal language models in image quality assessment, which often lack a unified framework for multi-granularity understanding. The authors propose the first unified four-task paradigm that jointly integrates global and local quality descriptions, pixel-level defect localization, and region-level referring expressions. To support this approach, they introduce a dedicated dataset and a two-stage training strategy: the first stage performs text-level multi-task fine-tuning, while the second stage incorporates a training-free text-to-point localization mechanism to seamlessly bridge semantic understanding with pixel-level perception. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple benchmarks, confirming its effectiveness and generalizability in interpretable, multi-granular image quality assessment.

📝 Abstract

We present IQA-Spider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring into a single LMM-based framework for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, such as quality description and question answering~(\textit{i.e.}, reasoning) or pixel-level grounding. This limitation largely stems from the absence of (i) a unified task and data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopt a conflict-free two-stage design that progressively extends text-level multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained text-level reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework.

Problem

Research questions and friction points this paper is trying to address.

image quality assessment

multi-granularity

reasoning

grounding

referring

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-granularity image quality assessment

large multimodal model

text-to-point grounding