Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper addresses two critical limitations of vision-language models (VLMs) in blind image quality assessment (BIQA): *unreasonable inference*—where textual descriptions contradict predicted quality scores—and *unstable prediction*—characterized by large fluctuations in quality scores across reasoning steps. We identify the root causes as weak causal relationships between visual features and quality judgments, and token-level preference bias in the decoder. To address these, we propose a novel two-stage decoupled fine-tuning paradigm that explicitly separates visual perception from quality reasoning for the first time, enabling human-aligned, interpretable, and stable inference. Our method incorporates multi-layer feature decoding analysis, vision–semantics decoupled training, and stability quantification via prediction volatility. Evaluations on SPAQ and KONIQ show a 9.61-percentage-point reduction in instability rate; average Spearman rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) improve by 0.3124 and 0.3507, respectively, across LIVE, CSIQ, SPAQ, and KONIQ.

Technology Category

Application Category

📝 Abstract

Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.

Problem

Research questions and friction points this paper is trying to address.

Vision-language models produce contradictory text and quality predictions in BIQA

Quality predictions lack grounding in visual features and show logical inconsistency

Models exhibit unstable inference due to limited token usage during reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage tuning separates visual perception from quality inference

Model learns visual features first then infers quality solely

Reduces prediction instability and improves reliability of inference process

🔎 Similar Papers

No similar papers found.