Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottlenecks of data scarcity and inadequate reasoning capability in multimodal video misinformation detection, this paper introduces FakeVV—the first large-scale, diverse benchmark comprising over 100,000 video–text pairs. We further propose Fact-R1, a novel three-stage collaborative reinforcement learning framework that uniquely integrates chain-of-thought (CoT) reasoning, direct preference optimization (DPO), and group-relative policy optimization (GRPO). Fact-R1 leverages verifiable reward functions and multimodal alignment modeling to enhance both detection accuracy and interpretability. Experimental results demonstrate that Fact-R1 achieves a 12.7% absolute improvement over state-of-the-art methods on FakeVV. Moreover, it enables fine-grained attribution and human-verifiable reasoning traces, significantly advancing transparency and trustworthiness in multimodal misinformation detection.

Technology Category

Application Category

📝 Abstract
The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale datasets for video misinformation detection
Existing methods overfit rigid templates without deep reasoning
Need for interpretable multimodal misinformation detection frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale FakeVV benchmark with interpretable annotations
Fact-R1 integrates deep reasoning and rule-based reinforcement learning
Three-stage training: CoT tuning, DPO alignment, GRPO optimization
🔎 Similar Papers
No similar papers found.
F
Fanrui Zhang
University of Science and Technology of China, Shanghai Innovation Institute
Dian Li
Dian Li
Tencent.com
MLLMvideo understandingself-supervised learningvision-language
Q
Qiang Zhang
University of Science and Technology of China
C
Chenjun
Tencent QQ
S
sinbadliu
Tencent QQ
Junxiong Lin
Junxiong Lin
Fudan University
Computer Vision
J
Jiahong Yan
Tencent QQ
J
Jiawei Liu
University of Science and Technology of China
Z
Zheng-Jun Zha
University of Science and Technology of China