🤖 AI Summary
To address the bottlenecks of data scarcity and inadequate reasoning capability in multimodal video misinformation detection, this paper introduces FakeVV—the first large-scale, diverse benchmark comprising over 100,000 video–text pairs. We further propose Fact-R1, a novel three-stage collaborative reinforcement learning framework that uniquely integrates chain-of-thought (CoT) reasoning, direct preference optimization (DPO), and group-relative policy optimization (GRPO). Fact-R1 leverages verifiable reward functions and multimodal alignment modeling to enhance both detection accuracy and interpretability. Experimental results demonstrate that Fact-R1 achieves a 12.7% absolute improvement over state-of-the-art methods on FakeVV. Moreover, it enables fine-grained attribution and human-verifiable reasoning traces, significantly advancing transparency and trustworthiness in multimodal misinformation detection.
📝 Abstract
The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.