๐ค AI Summary
Current evaluation metrics (e.g., VideoScore) for synthetic videos lack sensitivity to violations of commonsense and physical lawsโand suffer from poor interpretability. Method: We introduce VideoHallu, the first multimodal hallucination benchmark for synthetic videos, featuring expert-designed, cross-category question-answering tasks covering commonsense and physical causality. We further propose a fine-grained alignment method based on Group Relative Policy Optimization (GRPO), incorporating counterexample-augmented training to enhance reasoning robustness. Contribution/Results: VideoHallu systematically exposes severe hallucinations in leading multimodal large language models (MLLMs)โGPT-4o, Gemini 2.5 Pro, Qwen-2.5-VL, and Video-R1โwhen interpreting synthetic videos. Our GRPO-based method achieves +12.7% average accuracy gain on VideoHallu, rising to +19.3% on physical causality tasks. We publicly release the dataset and evaluation code to advance standardized, trustworthy assessment of synthetic video understanding.
๐ Abstract
Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities. Our data is available at https://github.com/zli12321/VideoHallu.