π€ AI Summary
To address the challenge of temporally precise localization of tampered segments in deepfake videos, this paper proposes a detection method based on cross-modal speech representation reconstruction. The core contribution is the first introduction of a bidirectional audioβlip reconstruction mechanism for deepfake temporal localization: it exploits reconstruction discrepancies arising from audio-visual semantic inconsistency in forged regions to enhance sensitivity to subtle manipulations. Our method employs a contrastive reconstruction strategy that jointly models deep semantic features from audio and visual streams, enabling frame-level localization via reconstruction error. Extensive experiments on LAV-DF, AV-Deepfake1M, and in-the-wild datasets demonstrate significant improvements over state-of-the-art methods: +8.9 in AP@0.95, +9.6 in AP@0.5, and +5.1 in AUC.
π Abstract
With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.