Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work on AI-generated video (deepfake) detection overlooks the fine-grained spatiotemporal cues humans rely on for identification, hindering explainable and human-aligned evaluation. Method: We introduce DeeptraceReward, the first spatiotemporally aware deepfake perception benchmark, comprising 3.3K videos and 4.3K fine-grained annotations—spanning spatial regions, temporal timestamps, and natural-language explanations. We systematically categorize nine interpretable forgery trace patterns and propose the first multimodal large language model–based reward model (7B parameters) capable of jointly localizing, classifying, and explaining spatiotemporal anomalies. Results: Our model achieves an average 34.7% improvement over GPT-5 across forgery cue recognition, localization, and explanation tasks, empirically validating the increasing difficulty hierarchy—from binary classification to spatiotemporally grounded reasoning. This establishes a foundation for human-centric, interpretable deepfake assessment.

Technology Category

Application Category

📝 Abstract
Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.
Problem

Research questions and friction points this paper is trying to address.

Detecting human-perceptible deepfake traces in AI-generated videos
Creating benchmark for spatially and temporally grounded fake artifacts
Training multimodal models to mimic human fake detection capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs detect human-perceived deepfake traces
Spatiotemporal benchmark with bounding boxes and timestamps
Reward model outperforms GPT-5 in fake clue identification
🔎 Similar Papers
No similar papers found.