Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) are prone to hallucination in multimodal tasks, which hinders their practical deployment. This work proposes a training-free framework that mitigates hallucination by leveraging an explicit visual grounding agent to extract structured visual evidence and introducing an interpretable evidence verification mechanism for iterative self-refinement. The approach effectively reduces hallucinatory outputs without over-correction and provides transparent diagnostic traces for model decisions. Evaluated on the POPE and MME-Hallucination benchmarks, the method outperforms strong baselines by 3.31% and 28.34 points, respectively. Ablation studies further demonstrate that each component of the framework contributes an average performance gain of 2.0%.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

Problem

Research questions and friction points this paper is trying to address.

LVLM hallucination

multimodal tasks

hallucination mitigation

visual-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual grounding

self-refinement

hallucination mitigation

training-free