🤖 AI Summary
This work addresses the limited perceptual scope of existing large multimodal language models in image quality assessment, which often lack a unified framework for multi-granularity understanding. The authors propose the first unified four-task paradigm that jointly integrates global and local quality descriptions, pixel-level defect localization, and region-level referring expressions. To support this approach, they introduce a dedicated dataset and a two-stage training strategy: the first stage performs text-level multi-task fine-tuning, while the second stage incorporates a training-free text-to-point localization mechanism to seamlessly bridge semantic understanding with pixel-level perception. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple benchmarks, confirming its effectiveness and generalizability in interpretable, multi-granular image quality assessment.
📝 Abstract
We present IQA-Spider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring into a single LMM-based framework for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, such as quality description and question answering~(\textit{i.e.}, reasoning) or pixel-level grounding. This limitation largely stems from the absence of (i) a unified task and data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopt a conflict-free two-stage design that progressively extends text-level multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained text-level reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework.