🤖 AI Summary
Existing methods for evaluating infrared-visible image fusion quality rely either on handcrafted no-reference metrics or full-reference metrics that treat source images as pseudo ground truth, both of which struggle to accurately capture human visual preferences. This work proposes FuScore, a novel framework that introduces multimodal large language models to this task for the first time, leveraging distributional regression to produce continuous quality scores capable of fine-grained differentiation among similarly performing fusion results. The approach innovatively integrates the Thurstone psychometric model to construct soft labels reflecting human judgment consistency and employs a triple loss function that jointly optimizes method-level and scene-level ranking performance. Experimental results demonstrate that FuScore significantly outperforms existing methods across multiple evaluation metrics and achieves state-of-the-art correlation with human visual preferences.
📝 Abstract
Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.