🤖 AI Summary
To address the irrecoverability of authentic audio and the difficulty of localizing tampered regions in synthetic audio-visual forgeries (SAVFs), this paper proposes the first cross-modal watermarking framework: it encodes the original audio into a robust visual watermark embedded directly into video frames, enabling semantic-level audio reconstruction and pixel-level tampering localization under forgery. Methodologically, we unify audio recovery and tampering detection within a single end-to-end differentiable watermark encoder-decoder architecture, augmented by a discrepancy-aware comparison mechanism. Our approach achieves high-fidelity audio reconstruction (PESQ > 3.2) and fine-grained tampering localization (mAP@0.5 > 91%). Extensive evaluations across diverse speech cloning and lip-sync forgery scenarios demonstrate significant improvements over state-of-the-art baselines. This work establishes a verifiable and traceable paradigm for defending against audio-visual deepfakes.
📝 Abstract
Recent advances in voice cloning and lip synchronization models have enabled Synthesized Audiovisual Forgeries (SAVFs), where both audio and visuals are manipulated to mimic a target speaker. This significantly increases the risk of misinformation by making fake content seem real. To address this issue, existing methods detect or localize manipulations but cannot recover the authentic audio that conveys the semantic content of the message. This limitation reduces their effectiveness in combating audiovisual misinformation. In this work, we introduce the task of Authentic Audio Recovery (AAR) and Tamper Localization in Audio (TLA) from SAVFs and propose a cross-modal watermarking framework to embed authentic audio into visuals before manipulation. This enables AAR, TLA, and a robust defense against misinformation. Extensive experiments demonstrate the strong performance of our method in AAR and TLA against various manipulations, including voice cloning and lip synchronization.