Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
This work addresses the limitations of existing multimodal large language models in image quality assessment, which rely solely on single-pass observation and text-only reasoning, lacking human-like evidence-seeking mechanisms and thus yielding unreliable feedback that hinders iterative optimization. To overcome this, the authors propose Q-DeepSight, a framework featuring interleaved multimodal Chain-of-Thought (iMCoT) reasoning and tool-augmented visual evidence gathering—such as cropping and zooming—to explicitly localize and explain regions of quality degradation and their underlying causes. The approach introduces two key innovations: a Perceptual Curriculum Reward (PCR) and Evidence Gradient Filtering (EGF), enabling effective long-trajectory reinforcement learning. Furthermore, it establishes PiG, a training-free perceptual generative closed-loop optimization framework. Experiments demonstrate state-of-the-art performance across diverse benchmarks—including natural, restored, and AI-generated images—and show its capability to guide iterative image enhancement, thereby closing the loop between assessment and optimization.

Technology Category

Application Category

📝 Abstract
Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight's diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.
Problem

Research questions and friction points this paper is trying to address.

Image Quality Assessment
Multimodal Large Language Models
Actionable Feedback
Visual Evidence
In-the-loop Refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Chain-of-Thought
Tool-augmented Reasoning
Perceptual Curriculum Reward
Evidence Gradient Filtering
Image Quality Assessment
🔎 Similar Papers
X
Xudong Li
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
J
Jiaxi Tan
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Z
Ziyin Zhou
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Yan Zhong
Yan Zhong
Peking University
Machine LearningDeep LearningComputer VisionData MiningLarge Language Models
Z
Zihao Huang
Beijing Institute of Technology
J
Jingyuan Zheng
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Yan Zhang
Yan Zhang
Xiamen University
Statistics
Xiawu Zheng
Xiawu Zheng
Associate Professor, IEEE Senior Member, Xiamen University
Automated Machine LearningNetwork CompressionNeural Architecture SearchAutoML
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China