VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

๐Ÿ“… 2025-05-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current evaluation metrics (e.g., VideoScore) for synthetic videos lack sensitivity to violations of commonsense and physical lawsโ€”and suffer from poor interpretability. Method: We introduce VideoHallu, the first multimodal hallucination benchmark for synthetic videos, featuring expert-designed, cross-category question-answering tasks covering commonsense and physical causality. We further propose a fine-grained alignment method based on Group Relative Policy Optimization (GRPO), incorporating counterexample-augmented training to enhance reasoning robustness. Contribution/Results: VideoHallu systematically exposes severe hallucinations in leading multimodal large language models (MLLMs)โ€”GPT-4o, Gemini 2.5 Pro, Qwen-2.5-VL, and Video-R1โ€”when interpreting synthetic videos. Our GRPO-based method achieves +12.7% average accuracy gain on VideoHallu, rising to +19.3% on physical causality tasks. We publicly release the dataset and evaluation code to advance standardized, trustworthy assessment of synthetic video understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities. Our data is available at https://github.com/zli12321/VideoHallu.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-modal hallucinations in synthetic videos
Mitigating commonsense and physics violations in generated videos
Improving MLLMs' reasoning on synthetic video abnormalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLMs as interpretable evaluators
Introduces VideoHallu benchmark with expert QA
Fine-tunes MLLMs with GRPO on synthetic data
๐Ÿ”Ž Similar Papers
No similar papers found.