Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

217K/year
πŸ€– AI Summary
Large vision-language models (LVLMs) frequently generate hallucinations due to insufficient attention to action relationships in images. This work proposes Relation-aware Visual Enhancement (RVE), a method that identifies attention heads most sensitive to action relationships using an Action Relationship Sensitivity (ARS) score and amplifies the model’s focus on relevant image regions. RVE can be integrated into existing LVLM architectures with minimal computational overhead, significantly outperforming baseline approaches in mitigating action-related hallucinations. Moreover, it demonstrates strong generalization to spatial relationship and object hallucinations without appreciably increasing inference cost.
πŸ“ Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.
Problem

Research questions and friction points this paper is trying to address.

action-relation hallucinations
vision-language models
visual hallucination
relation hallucination
LVLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Relation-aware Visual Enhancement
Action-Relation Hallucination
Attention Localization
Large Vision-Language Models
Visual Hallucination Mitigation
πŸ”Ž Similar Papers