ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

📅 2024-11-16

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the pervasive hallucination problem in text-to-video (T2V) generation by large multimodal models (LMMs). We first systematically define and annotate five canonical hallucination categories, then construct ViBe—the first open-source, human-verified large-scale T2V hallucination benchmark—comprising 3,782 video samples generated from 837 COCO prompts. To detect hallucinations, we propose a video embedding framework combining TimeSformer and CNN features, coupled with ensemble classification, and establish a standardized human annotation protocol. Comprehensive evaluation across ten state-of-the-art T2V models reveals that the best-performing baseline achieves only 0.345 accuracy, underscoring the significant challenges in automated hallucination detection. The ViBe dataset and evaluation code are publicly released, providing critical infrastructure and a new standard for quantitatively assessing reliability and improving robustness of T2V models.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Multimodal Models (LMMs) have expanded their capabilities to video understanding, with Text-to-Video (T2V) models excelling in generating videos from textual prompts. However, they still frequently produce hallucinated content, revealing AI-generated inconsistencies. We introduce ViBe (https://vibe-t2v-bench.github.io/): a large-scale dataset of hallucinated videos from open-source T2V models. We identify five major hallucination types: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. Using ten T2V models, we generated and manually annotated 3,782 videos from 837 diverse MS COCO captions. Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. ViBe serves as a critical resource for evaluating T2V reliability and advancing hallucination detection. We establish classification as a baseline, with the TimeSFormer + CNN ensemble achieving the best performance (0.345 accuracy, 0.342 F1 score). While initial baselines proposed achieve modest accuracy, this highlights the difficulty of automated hallucination detection and the need for improved methods. Our research aims to drive the development of more robust T2V models and evaluate their outputs based on user preferences.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hallucination in Text-to-Video models

Identifying major hallucination types in generated videos

Developing benchmarks for reliable video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for hallucinated video evaluation

Classification framework using video embeddings

TimeSFormer + CNN ensemble for hallucination detection

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs