AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

πŸ“… 2025-11-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of temporally precise localization of tampered segments in deepfake videos, this paper proposes a detection method based on cross-modal speech representation reconstruction. The core contribution is the first introduction of a bidirectional audio–lip reconstruction mechanism for deepfake temporal localization: it exploits reconstruction discrepancies arising from audio-visual semantic inconsistency in forged regions to enhance sensitivity to subtle manipulations. Our method employs a contrastive reconstruction strategy that jointly models deep semantic features from audio and visual streams, enabling frame-level localization via reconstruction error. Extensive experiments on LAV-DF, AV-Deepfake1M, and in-the-wild datasets demonstrate significant improvements over state-of-the-art methods: +8.9 in AP@0.95, +9.6 in AP@0.5, and +5.1 in AUC.

Technology Category

Application Category

πŸ“ Abstract
With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.
Problem

Research questions and friction points this paper is trying to address.

Detecting temporal deepfake manipulations in audio-visual content
Reconstructing speech representations across audio and visual modalities
Localizing forged segments by analyzing cross-modal reconstruction discrepancies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructs speech representations across audio-visual modalities
Amplifies discrepancies in manipulated video segments
Enables precise temporal localization of deepfakes
πŸ”Ž Similar Papers
No similar papers found.