FakeHunter: Multimodal Step-by-Step Reasoning for Explainable Video Forensics

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

To address the lack of interpretability in video forgery detection, this paper proposes FakeHunter—the first multimodal forensic framework integrating memory-guided retrieval, Observation-Thinking-Action (OTA) chain-of-reasoning, and tool-augmented verification. Methodologically, it jointly encodes audio-visual features using CLIP and CLAP, constructs a real-sample memory bank via FAISS, and employs Qwen2.5-Omni-7B for cross-modal reasoning, dynamically invoking fine-grained image/audio analysis tools for posterior validation. On our curated benchmark X-AVFake, the primary model achieves 34.75% accuracy—substantially outperforming baselines; after tool-based verification, accuracy on low-confidence samples rises to 46.50%, enabling near-real-time inference on a single GPU. The core contribution is a traceable, verifiable, and extensible multimodal stepwise reasoning paradigm, markedly enhancing detection transparency and robustness.

Technology Category

Application Category

📝 Abstract

FakeHunter is a multimodal deepfake detection framework that combines memory-guided retrieval, chain-of-thought (Observation-Thought-Action) reasoning, and tool-augmented verification to provide accurate and interpretable video forensics. FakeHunter encodes visual content using CLIP and audio using CLAP, generating joint audio-visual embeddings that retrieve semantically similar real exemplars from a FAISS-indexed memory bank for contextual grounding. Guided by the retrieved context, the system iteratively reasons over evidence to localize manipulations and explain them. When confidence is low, it automatically invokes specialized tools-such as zoom-in image forensics or mel-spectrogram inspection-for fine-grained verification. Built on Qwen2.5-Omni-7B, FakeHunter produces structured JSON verdicts that specify what was modified, where it occurs, and why it is judged fake. We also introduce X-AVFake, a benchmark comprising 5.7k+ manipulated and real videos (950+ min) annotated with manipulation type, region/entity, violated reasoning category, and free-form justification. On X-AVFake, FakeHunter achieves an accuracy of 34.75%, outperforming the vanilla Qwen2.5-Omni-7B by 16.87 percentage points and MiniCPM-2.6 by 25.56 percentage points. Ablation studies reveal that memory retrieval contributes a 7.75 percentage point gain, and tool-based inspection improves low-confidence cases to 46.50%. Despite its multi-stage design, the pipeline processes a 10-minute clip in 8 minutes on a single NVIDIA A800 (0.8x real-time) or 2 minutes on four GPUs (0.2x), demonstrating practical deployability.

Problem

Research questions and friction points this paper is trying to address.

Detecting manipulated videos using multimodal deepfake detection

Providing explainable forensics with localization and justification

Automating tool-augmented verification for low-confidence cases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework combining memory-guided retrieval and reasoning

Uses CLIP and CLAP for audio-visual embeddings and FAISS retrieval

Invokes specialized tools for fine-grained verification when needed

🔎 Similar Papers

No similar papers found.