Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language multimodal large models (MLLMs) generate textual explanations for image quality assessment (IQA) but exhibit poor reliability in detecting low-level distortions—such as blur, noise, and compression—and suffer from inconsistent reasoning, revealing a critical weakness in their vision encoders: insufficient preservation of low-level visual features during vision–language alignment. Method: We first formulate a low-level distortion-aware classification task, quantifying alignment-induced distortion recognition bias via component-level fine-grained analysis and semantic distance measurement. We then introduce an alignment-constrained mechanism and establish a novel multimodal distortion classification benchmark. Contribution/Results: Experiments demonstrate that our approach elevates distortion-type classification accuracy from 14.92% to 84.43%, marking the first systematic validation that strengthening vision–language alignment in the encoder simultaneously enhances low-level distortion perception and textual explanation consistency.

Technology Category

Application Category

📝 Abstract
Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.
Problem

Research questions and friction points this paper is trying to address.

Examines MLLMs' failure to detect low-level image distortions like blur and noise
Investigates inconsistent evaluations and biases in MLLM-based quality scoring
Proposes enhancing vision-language alignment to improve distortion recognition accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Component-wise fine-tuning enhances vision-language alignment
Dedicated constraints on vision encoder improve distortion recognition
Strengthened visual representations enable coherent MLLM-based reasoning
🔎 Similar Papers
No similar papers found.
Y
Yuan Li
Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Z
Zitang Sun
Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Y
Yen-Ju Chen
Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Shin'ya Nishida
Shin'ya Nishida
Kyoto University, NTT
visionperception