🤖 AI Summary
Generative super-resolution (GSR) models improve perceptual quality but often introduce perceptually inconsistent “hallucinated” details—artifacts misaligned with either the low-resolution input or ground-truth high-resolution images—hindering real-world deployment.
Method: We propose the first hallucination quantification metric based on multimodal large language models (MLLMs), yielding scores highly correlated with human subjective assessments (Pearson’s *r* > 0.92). To mitigate hallucination, we design a differentiable deep feature distance as a reinforcement learning reward signal to enforce input-output semantic consistency in the generator.
Results: Our approach significantly suppresses hallucination (average reduction of 37.6%) while preserving fidelity. Crucially, the MLLM-based hallucination score is complementary to conventional metrics (e.g., LPIPS, NIQE), enabling more holistic GSR evaluation and optimization. This work establishes a new paradigm for hallucination-aware GSR assessment and training.
📝 Abstract
Generative super-resolution (GSR) currently sets the state-of-the-art in terms of perceptual image quality, overcoming the "regression-to-the-mean" blur of prior non-generative models. However, from a human perspective, such models do not fully conform to the optimal balance between quality and fidelity. Instead, a different class of artifacts, in which generated details fail to perceptually match the low resolution image (LRI) or ground-truth image (GTI), is a critical but under studied issue in GSR, limiting its practical deployments. In this work, we focus on measuring, analyzing, and mitigating these artifacts (i.e., "hallucinations"). We observe that hallucinations are not well-characterized with existing image metrics or quality models, as they are orthogonal to both exact fidelity and no-reference quality. Instead, we take advantage of a multimodal large language model (MLLM) by constructing a prompt that assesses hallucinatory visual elements and generates a "Hallucination Score" (HS). We find that our HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models. In addition, we find certain deep feature distances have strong correlations with HS. We therefore propose to align the GSR models by using such features as differentiable reward functions to mitigate hallucinations.