FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Video Multimodal Large Language Models (VideoMLLMs) suffer from pervasive factual hallucinations in bidirectional video–text generation tasks, yet existing evaluation methods are limited to unidirectional settings and struggle with open-ended outputs. To address this, we propose FIFA, a unified fidelity evaluation framework that supports factual consistency assessment for both text-to-video and video-to-text generation—the first of its kind. FIFA innovatively constructs a spatiotemporal semantic dependency graph to model cross-modal factual structure, integrating factual triplet extraction, video question answering for verification, and a tool-augmented Post-Correction mechanism for output refinement. Experiments demonstrate that FIFA achieves strong agreement with human judgments (Spearman’s ρ > 0.85), significantly outperforming baseline methods. Moreover, the Post-Correction step improves factual accuracy of generated content by an average of 22.3%.

Technology Category

Application Category

📝 Abstract

Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating faithfulness in VideoMLLMs across tasks

Addressing hallucinations in open-ended multimodal responses

Lack of unified framework for fact verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified faithfulness evaluation framework FIFA

Spatio-Temporal Semantic Dependency Graph

Post-Correction tool for hallucination revision

🔎 Similar Papers

No similar papers found.