ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

To address multimodal hallucinations—i.e., factually inconsistent video descriptions—in ResNetVLLM, this paper proposes a two-stage solution. First, it introduces the first video–text cross-modal faithfulness detection protocol, leveraging an enhanced Lynx semantic alignment model for fine-grained hallucination identification. Second, it constructs a dynamic, inference-time knowledge base integrated with retrieval-augmented generation (RAG) to enable real-time hallucination mitigation. Unlike conventional static fine-tuning paradigms, this approach unifies detection and correction into a single, adaptive framework. Evaluated on ActivityNet-QA, the method improves accuracy from 54.8% to 65.3%, demonstrating substantial gains in factual consistency and generation reliability for video-language understanding.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have transformed natural language processing (NLP) tasks, but they suffer from hallucination, generating plausible yet factually incorrect content. This issue extends to Video-Language Models (VideoLLMs), where textual descriptions may inaccurately represent visual content, resulting in multi-modal hallucinations. In this paper, we address hallucination in ResNetVLLM, a video-language model combining ResNet visual encoders with LLMs. We introduce a two-step protocol: (1) a faithfulness detection strategy that uses a modified Lynx model to assess semantic alignment between generated captions and ground-truth video references, and (2) a hallucination mitigation strategy using Retrieval-Augmented Generation (RAG) with an ad-hoc knowledge base dynamically constructed during inference. Our enhanced model, ResNetVLLM-2, reduces multi-modal hallucinations by cross-verifying generated content against external knowledge, improving factual consistency. Evaluation on the ActivityNet-QA benchmark demonstrates a substantial accuracy increase from 54.8% to 65.3%, highlighting the effectiveness of our hallucination detection and mitigation strategies in enhancing video-language model reliability.

Problem

Research questions and friction points this paper is trying to address.

Reducing multi-modal hallucinations in video-language models

Improving semantic alignment between captions and video content

Enhancing factual consistency using retrieval-augmented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modified Lynx model for semantic alignment

Retrieval-Augmented Generation with dynamic knowledge

Cross-verification against external knowledge

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey