🤖 AI Summary
To address multimodal hallucinations—i.e., factually inconsistent video descriptions—in ResNetVLLM, this paper proposes a two-stage solution. First, it introduces the first video–text cross-modal faithfulness detection protocol, leveraging an enhanced Lynx semantic alignment model for fine-grained hallucination identification. Second, it constructs a dynamic, inference-time knowledge base integrated with retrieval-augmented generation (RAG) to enable real-time hallucination mitigation. Unlike conventional static fine-tuning paradigms, this approach unifies detection and correction into a single, adaptive framework. Evaluated on ActivityNet-QA, the method improves accuracy from 54.8% to 65.3%, demonstrating substantial gains in factual consistency and generation reliability for video-language understanding.
📝 Abstract
Large Language Models (LLMs) have transformed natural language processing (NLP) tasks, but they suffer from hallucination, generating plausible yet factually incorrect content. This issue extends to Video-Language Models (VideoLLMs), where textual descriptions may inaccurately represent visual content, resulting in multi-modal hallucinations. In this paper, we address hallucination in ResNetVLLM, a video-language model combining ResNet visual encoders with LLMs. We introduce a two-step protocol: (1) a faithfulness detection strategy that uses a modified Lynx model to assess semantic alignment between generated captions and ground-truth video references, and (2) a hallucination mitigation strategy using Retrieval-Augmented Generation (RAG) with an ad-hoc knowledge base dynamically constructed during inference. Our enhanced model, ResNetVLLM-2, reduces multi-modal hallucinations by cross-verifying generated content against external knowledge, improving factual consistency. Evaluation on the ActivityNet-QA benchmark demonstrates a substantial accuracy increase from 54.8% to 65.3%, highlighting the effectiveness of our hallucination detection and mitigation strategies in enhancing video-language model reliability.