mAVE: A Watermark for Joint Audio-Visual Generation Models

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the vulnerability of existing watermarking methods in joint audio-visual generative models to Swap attacks, which can lead to forged content being falsely authenticated, thereby compromising copyright integrity and content trustworthiness. To counter this, the authors propose mAVE, a native watermarking framework specifically designed for joint generative models. By integrating inverse transform sampling with cryptographic binding during model initialization, mAVE constructs an entangled audio-visual manifold in the latent space, establishing strong cross-modal coupling. The method requires no fine-tuning and achieves watermark embedding without performance degradation on state-of-the-art models such as LTX-2 and MOVA. It ensures over 99% binding integrity and provides an exponential security margin, fundamentally mitigating audio-visual replacement attacks and safeguarding both content provenance and copyright.

Technology Category

Application Category

📝 Abstract

As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification ($Video_{wm}\vee Audio_{wm}$), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity ($>99\%$), mAVE offers a robust cryptographic defense for vendor copyright.

Problem

Research questions and friction points this paper is trying to address.

Joint Audio-Visual Generation

Watermarking

Binding Vulnerability

Swap Attacks

Content Provenance

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual watermarking

manifold entanglement

swap attack resistance