Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from joint object-action hallucination in video captioning, particularly exhibiting severe factual inconsistency in dynamic scenes. To address this, we propose the Self-Enhanced Contrastive Alignment Framework (SCAF), the first method to jointly suppress object and action hallucinations. SCAF introduces a self-hallucination enhancement mechanism to generate high-quality negative samples, integrates trajectory-aware object tracking with relation-guided action modeling, and achieves fine-grained vision-language alignment at the phrase level. The framework unifies self-enhanced contrastive learning, regional object tracking, relation-aware action modeling, and phrase-level visual grounding. Extensive evaluations on multiple hallucination benchmarks demonstrate significant improvements over state-of-the-art methods: object and action fabrication rates are reduced by 23.6% and 19.4%, respectively, substantially enhancing factual accuracy and semantic fidelity of generated captions.

Technology Category

Application Category

📝 Abstract
Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Mitigates object and action hallucinations in video captioning MLLMs
Addresses spurious correlations and enforces focus on visual facts
Alleviates factual inaccuracies in dynamic video descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-augmented contrastive alignment for object and action faithfulness
Hallucinative self-augmentation to generate contrasted negative captions
Tracklet-phrase contrastive alignment matches objects and actions with phrases
🔎 Similar Papers
No similar papers found.