Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two critical limitations of vision-language models (VLMs) in blind image quality assessment (BIQA): *unreasonable inference*—where textual descriptions contradict predicted quality scores—and *unstable prediction*—characterized by large fluctuations in quality scores across reasoning steps. We identify the root causes as weak causal relationships between visual features and quality judgments, and token-level preference bias in the decoder. To address these, we propose a novel two-stage decoupled fine-tuning paradigm that explicitly separates visual perception from quality reasoning for the first time, enabling human-aligned, interpretable, and stable inference. Our method incorporates multi-layer feature decoding analysis, vision–semantics decoupled training, and stability quantification via prediction volatility. Evaluations on SPAQ and KONIQ show a 9.61-percentage-point reduction in instability rate; average Spearman rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) improve by 0.3124 and 0.3507, respectively, across LIVE, CSIQ, SPAQ, and KONIQ.

Technology Category

Application Category

📝 Abstract
Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.
Problem

Research questions and friction points this paper is trying to address.

Vision-language models produce contradictory text and quality predictions in BIQA
Quality predictions lack grounding in visual features and show logical inconsistency
Models exhibit unstable inference due to limited token usage during reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage tuning separates visual perception from quality inference
Model learns visual features first then infers quality solely
Reduces prediction instability and improves reliability of inference process
🔎 Similar Papers
No similar papers found.
Y
Yuan Li
Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Z
Zitang Sun
Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Y
Yen-ju Chen
Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Shin'ya Nishida
Shin'ya Nishida
Kyoto University, NTT
visionperception