π€ AI Summary
This work addresses the prevalent issue of object hallucination in large vision-language models (LVLMs), which undermines the reliability of generated outputs. The authors propose a novel training strategy that classifies generated tokens based on their degree of dependence on the input image into three categories: positively correlated, invariant, and negatively correlated. Leveraging this classification, the method dynamically adjusts token-level training weights and integrates a hallucination-aware data filtering mechanism. Notably, this approach introduces visual dependency into the training weighting scheme for the first time, effectively suppressing object hallucinations across three LVLM variants without incurring additional inference overhead. Experimental results demonstrate a significant reduction in hallucination rates while preserving response length and computational efficiency, confirming the methodβs effectiveness and generalizability.
π Abstract
Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.