🤖 AI Summary
This work addresses the limitations of existing deepfake detection models, whose natural language explanations often lack visual grounding and are evaluated primarily on classification accuracy while neglecting reasoning faithfulness. To this end, we propose DeepfakeJudge, a novel framework that introduces reasoning fidelity as a quantifiable evaluation dimension. We construct an out-of-distribution deepfake benchmark with a human-annotated visual reasoning subset and design a bootstrapped generate-and-evaluate loop that operates without ground-truth reasoning labels, enabling scalable supervision and assessment. Leveraging a multimodal large language model as the evaluator, our approach supports both pointwise and pairwise evaluation. Experiments show that DeepfakeJudge achieves 96.2% accuracy on a meta-evaluation benchmark—significantly outperforming a baseline 30 times larger—and exhibits strong alignment with human judgments (98.9% pairwise agreement), with user studies indicating that 70% of participants prefer its generated explanations.
📝 Abstract
Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\%, outperforming \texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70\% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \href{https://github.com/KjAeRsTuIsK/DeepfakeJudge}{open-sourced}.