Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection

πŸ“… 2025-10-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video fake news detection (VFND) benchmarks solely evaluate final decision accuracy, lacking fine-grained interpretability analysis of perceptual, comprehension, and reasoning processes. To address this, we propose MVFNDB, a multimodal VFND benchmark comprising 10 fine-grained tasks and 9,730 human-annotated questions, along with a capability taxonomy and a staged evaluation framework. We introduce MFVND-CoT, a novel chain-of-thought reasoning paradigm that jointly models creator intent and raw visual features, enhanced by video–text alignment and multi-feature fusion. Extensive experiments demonstrate that our framework significantly improves both detection performance and model interpretability. MVFNDB is the first systematically designed, high-quality, human-annotated benchmark enabling comprehensive capability assessment and mechanistic analysis of multimodal large language models (MLLMs) in VFND.

Technology Category

Application Category

πŸ“ Abstract
The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal models' perception, understanding, and reasoning in fake news detection
Providing fine-grained assessment beyond final accuracy for detection processes
Analyzing video feature alignment and processing strategies' impact on detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MVFNDB benchmark with 10 tasks
Designs MVFND-CoT framework combining multiple features
Analyzes video processing strategies and feature alignment
πŸ”Ž Similar Papers
No similar papers found.
Y
Yakun Cui
The Hong Kong University of Science and Technology
Fushuo Huo
Fushuo Huo
The Hong Kong Polytechnic University
Large Vision Language ModelMultimodal LearningTrustworthy AI
Weijie Shi
Weijie Shi
Hong Kong University of Science and Technology
J
Juntao Dai
Peking University
H
Hang Du
Beijing University of Posts and Telecommunications
Z
Zhenghao Zhu
The Hong Kong University of Science and Technology
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
The Hong Kong University of Science and Technology