SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing unimodal defense methods struggle to effectively suppress speech-driven, photorealistic talking-head generation, posing significant misuse risks. This work proposes a stage-aware multimodal adversarial defense framework that jointly perturbs portrait and audio inputs to disrupt lip-sync coherence and facial dynamics while preserving perceptual quality. Key innovations include multi-interval sampling for static image guidance, a cross-attention deception mechanism to suppress audio-conditioned responses, and modality-specific perceptual constraints optimized across diffusion model stages. Experiments demonstrate that the proposed method substantially outperforms unimodal baselines under white-box active defense settings, effectively degrading temporal consistency in generated videos and exhibiting robustness against purification attacks.

Technology Category

Application Category

📝 Abstract

Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.

Problem

Research questions and friction points this paper is trying to address.

audio-driven talking head

multimodal attacks

lip synchronization

facial dynamics

diffusion-based generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Adversarial Attack

Diffusion-based Talking Head

Multi-Interval Sampling