Beyond Real versus Fake Towards Intent-Aware Video Analysis

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing deepfake detection methods focus predominantly on binary authenticity classification, neglecting the underlying dissemination intent behind manipulated videos. To address this gap, we propose IntentHQ—the first benchmark and methodology framework dedicated to video intent analysis—shifting the paradigm from mere authenticity verification to socially contextualized motivation understanding. We introduce the IntentHQ dataset, comprising 5,168 videos annotated with 23 fine-grained intent categories (e.g., financial fraud, political propaganda). Our approach employs a multimodal model that jointly leverages spatiotemporal visual, audio, and textual features through a synergistic supervised–self-supervised learning strategy, enabling cross-modal intent reasoning. Experiments demonstrate substantial improvements in identifying malicious videos’ latent motivations. IntentHQ establishes a human-centered foundation for deepfake governance, offering both a novel analytical paradigm and a scalable technical infrastructure.

Technology Category

Application Category

📝 Abstract

The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including "Financial fraud", "Indirect marketing", "Political propaganda", as well as "Fear mongering". We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.

Problem

Research questions and friction points this paper is trying to address.

Shifts focus from detecting deepfake videos to analyzing underlying intent.

Introduces a benchmark for categorizing videos into fine-grained intent types.

Uses multi-modality models to infer motivations behind manipulated video content.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shift from authenticity to intent analysis

Multi-modality models integrating video, audio, text

Benchmark with fine-grained intent categories annotation

🔎 Similar Papers

No similar papers found.