🤖 AI Summary
This work addresses the challenge of hallucination in large vision-language models (LVLMs) and the inadequacy of existing self-evaluation methods, which often rely excessively on linguistic priors and fail to accurately assess prediction reliability grounded in visual evidence. To this end, the authors propose a training-free, vision-aware uncertainty quantification framework that enhances model dependence on critical visual regions through an Image Information Score (IS) and an unsupervised salient-region masking strategy. By integrating predictive entropy with the IS derived from masked images, the method constructs a training-agnostic self-evaluation function. Extensive experiments demonstrate that this approach significantly outperforms current techniques across multiple benchmarks, enabling more accurate judgment of LVLM answer correctness and thereby improving the safety and reliability of model deployment.
📝 Abstract
Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.