🤖 AI Summary
To address the lack of interpretability in video forgery detection, this paper proposes FakeHunter—the first multimodal forensic framework integrating memory-guided retrieval, Observation-Thinking-Action (OTA) chain-of-reasoning, and tool-augmented verification. Methodologically, it jointly encodes audio-visual features using CLIP and CLAP, constructs a real-sample memory bank via FAISS, and employs Qwen2.5-Omni-7B for cross-modal reasoning, dynamically invoking fine-grained image/audio analysis tools for posterior validation. On our curated benchmark X-AVFake, the primary model achieves 34.75% accuracy—substantially outperforming baselines; after tool-based verification, accuracy on low-confidence samples rises to 46.50%, enabling near-real-time inference on a single GPU. The core contribution is a traceable, verifiable, and extensible multimodal stepwise reasoning paradigm, markedly enhancing detection transparency and robustness.
📝 Abstract
FakeHunter is a multimodal deepfake detection framework that combines memory-guided retrieval, chain-of-thought (Observation-Thought-Action) reasoning, and tool-augmented verification to provide accurate and interpretable video forensics. FakeHunter encodes visual content using CLIP and audio using CLAP, generating joint audio-visual embeddings that retrieve semantically similar real exemplars from a FAISS-indexed memory bank for contextual grounding. Guided by the retrieved context, the system iteratively reasons over evidence to localize manipulations and explain them. When confidence is low, it automatically invokes specialized tools-such as zoom-in image forensics or mel-spectrogram inspection-for fine-grained verification. Built on Qwen2.5-Omni-7B, FakeHunter produces structured JSON verdicts that specify what was modified, where it occurs, and why it is judged fake. We also introduce X-AVFake, a benchmark comprising 5.7k+ manipulated and real videos (950+ min) annotated with manipulation type, region/entity, violated reasoning category, and free-form justification. On X-AVFake, FakeHunter achieves an accuracy of 34.75%, outperforming the vanilla Qwen2.5-Omni-7B by 16.87 percentage points and MiniCPM-2.6 by 25.56 percentage points. Ablation studies reveal that memory retrieval contributes a 7.75 percentage point gain, and tool-based inspection improves low-confidence cases to 46.50%. Despite its multi-stage design, the pipeline processes a 10-minute clip in 8 minutes on a single NVIDIA A800 (0.8x real-time) or 2 minutes on four GPUs (0.2x), demonstrating practical deployability.