Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

261K/year

🤖 AI Summary

This study addresses the limitations of existing deepfake detection methods, which primarily target speaking states and struggle with the more deceptive "listening-state" forgeries. To bridge this gap, the work introduces the first listening-state deepfake detection task, presents ListenForge—the first dedicated dataset encompassing five state-of-the-art head-generation methods—and proposes MANet, a motion-aware and audio-guided network. MANet captures subtle motion inconsistencies in listening videos and integrates speaker audio semantics for cross-modal forgery detection. Experiments demonstrate that current speaking-state detectors suffer significant performance degradation in listening scenarios, whereas MANet substantially outperforms baseline approaches on ListenForge. This work thus breaks from the traditional speaking-centric detection paradigm and advances multimodal deepfake analysis toward interactive, real-world contexts.

Technology Category

Application Category

📝 Abstract

Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.

Problem

Research questions and friction points this paper is trying to address.

Listening Deepfake Detection

Deepfake

Multimodal Forgery

Interactive Communication

Non-speaking State

Innovation

Methods, ideas, or system contributions that make the work stand out.

Listening Deepfake Detection

ListenForge

Motion-aware Network