Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the irrecoverability of authentic audio and the difficulty of localizing tampered regions in synthetic audio-visual forgeries (SAVFs), this paper proposes the first cross-modal watermarking framework: it encodes the original audio into a robust visual watermark embedded directly into video frames, enabling semantic-level audio reconstruction and pixel-level tampering localization under forgery. Methodologically, we unify audio recovery and tampering detection within a single end-to-end differentiable watermark encoder-decoder architecture, augmented by a discrepancy-aware comparison mechanism. Our approach achieves high-fidelity audio reconstruction (PESQ > 3.2) and fine-grained tampering localization (mAP@0.5 > 91%). Extensive evaluations across diverse speech cloning and lip-sync forgery scenarios demonstrate significant improvements over state-of-the-art baselines. This work establishes a verifiable and traceable paradigm for defending against audio-visual deepfakes.

Technology Category

Application Category

📝 Abstract

Recent advances in voice cloning and lip synchronization models have enabled Synthesized Audiovisual Forgeries (SAVFs), where both audio and visuals are manipulated to mimic a target speaker. This significantly increases the risk of misinformation by making fake content seem real. To address this issue, existing methods detect or localize manipulations but cannot recover the authentic audio that conveys the semantic content of the message. This limitation reduces their effectiveness in combating audiovisual misinformation. In this work, we introduce the task of Authentic Audio Recovery (AAR) and Tamper Localization in Audio (TLA) from SAVFs and propose a cross-modal watermarking framework to embed authentic audio into visuals before manipulation. This enables AAR, TLA, and a robust defense against misinformation. Extensive experiments demonstrate the strong performance of our method in AAR and TLA against various manipulations, including voice cloning and lip synchronization.

Problem

Research questions and friction points this paper is trying to address.

Detect and recover authentic audio in synthesized forgeries

Localize tampered audio segments in manipulated content

Defend against misinformation from voice and lip sync fakes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal watermarking for audio recovery

Embed authentic audio into visuals

Robust defense against audiovisual forgeries

🔎 Similar Papers

No similar papers found.