🤖 AI Summary
This work addresses the prevalent issue of hallucinations in large vision-language models, where generated outputs often contradict visual inputs. Existing mitigation strategies typically incur high computational overhead or suffer from inefficient inference. To overcome these limitations, the authors propose a dynamic, training-free framework that operates during inference without requiring dual-path architectures. By applying lightweight interventions to real-time intermediate representations, the method detects and edits hallucinatory content on the fly. This approach achieves highly effective and controllable hallucination suppression with minimal additional computational cost. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, significantly enhancing model robustness and practical applicability.
📝 Abstract
Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at https://github.com/ASGO-MM/HIRE