Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing multimodal large language model (MLLM)-based image quality assessment (IQA) methods rely on coarse, holistic descriptions and lack fine-grained spatial awareness. This work introduces grounding-IQA—a novel paradigm that integrates visual grounding and localization capabilities into IQA for the first time, enabling coordinate-annotated fine-grained description (GIQA-DES) and localized region quality question answering (GIQA-VQA). Our contributions are threefold: (1) a localization-aware IQA task framework; (2) GIQA-160K—the first large-scale, automatically annotated IQA dataset—and GIQA-Bench, a comprehensive multi-dimensional evaluation benchmark; and (3) a tri-dimensional evaluation metric jointly measuring descriptive quality, VQA accuracy, and localization precision. Experiments demonstrate substantial improvements in models’ spatial sensitivity to local distortions (e.g., blurring, noise) and structural defects, as well as enhanced semantic discrimination capability for fine-grained quality attributes.

Technology Category

Application Category

📝 Abstract

The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, grounding-IQA. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark comprehensively evaluates the model grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application. Code: https://github.com/zhengchen1999/Grounding-IQA.

Problem

Research questions and friction points this paper is trying to address.

Enhances image quality assessment using multimodal language grounding.

Introduces fine-grained IQA tasks with precise location descriptions.

Develops a dataset and benchmark for detailed IQA evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates multimodal referring with IQA

Introduces GIQA-DES and GIQA-VQA subtasks

Develops GIQA-160K dataset and GIQA-Bench

🔎 Similar Papers

CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIP