Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing no-reference image quality assessment (NR-IQA) methods either neglect semantically salient regions or apply uniform weighting to local features, resulting in insufficient sensitivity to localized distortions. To address this, we propose RSFIQA—a novel NR-IQA framework that first introduces the Segment Anything Model (SAM) to enable semantic-aware, fine-grained region segmentation. It further leverages a multimodal large language model (MLLM) to interpret both regional content and distortion semantics, and designs a Region-aware Semantic Attention (RSA) mechanism to dynamically fuse low-level distortion features with high-level semantic representations. Extensive experiments demonstrate that RSFIQA achieves significant improvements in prediction accuracy and robustness across multiple benchmark datasets. Moreover, it supports diverse backbone architectures, confirming its strong generalizability and practical applicability in real-world NR-IQA scenarios.

Technology Category

Application Category

📝 Abstract

No-reference image quality assessment (NR-IQA) aims to simulate the process of perceiving image quality aligned with subjective human perception. However, existing NR-IQA methods either focus on global representations that leads to limited insights into the semantically salient regions or employ a uniform weighting for region features that weakens the sensitivity to local quality variations. In this paper, we propose a fine-grained image quality assessment model, named RSFIQA, which integrates region-level distortion information to perceive multi-dimensional quality discrepancies. To enhance regional quality awareness, we first utilize the Segment Anything Model (SAM) to dynamically partition the input image into non-overlapping semantic regions. For each region, we teach a powerful Multi-modal Large Language Model (MLLM) to extract descriptive content and perceive multi-dimensional distortions, enabling a comprehensive understanding of both local semantics and quality degradations. To effectively leverage this information, we introduce Region-Aware Semantic Attention (RSA) mechanism, which generates a global attention map by aggregating fine-grained representations from local regions. In addition, RSFIQA is backbone-agnostic and can be seamlessly integrated into various deep neural network architectures. Extensive experiments demonstrate the robustness and effectiveness of the proposed method, which achieves competitive quality prediction performance across multiple benchmark datasets.

Problem

Research questions and friction points this paper is trying to address.

Assessing image quality without reference images

Focusing on semantically salient regions in images

Addressing local quality variations in fine-grained assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment Anything Model for semantic regions

Multi-modal Large Language Model for distortions

Region-Aware Semantic Attention mechanism

🔎 Similar Papers

No similar papers found.