Efficient and Robust Video Defense Framework against 3D-field Personalized Talking Face

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
3D-field-driven personalized talking face generation (TFG) poses severe privacy threats to portrait videos, yet existing image-level defense methods suffer from high computational overhead, significant quality degradation, and insufficient disruption of critical 3D information. Method: We propose the first efficient and robust video-level defense framework, uniquely targeting perturbation of the 3D geometry and appearance modeling process. Our approach introduces a similarity-guided parameter-sharing mechanism and a multi-scale dual-domain attention module for joint spatial-frequency optimization, coupled with lightweight 3D-aware perturbations and adversarial robustness enhancement strategies. Contribution/Results: Experiments demonstrate substantial improvements in defense success rate, with inference speed accelerated by 47× over the fastest baseline. The framework exhibits strong robustness against geometric scaling and state-of-the-art purification attacks. Ablation studies confirm the distinct and complementary contributions of each component.

Technology Category

Application Category

📝 Abstract
State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at https://github.com/Richen7418/VDF.
Problem

Research questions and friction points this paper is trying to address.

Defend portrait videos against 3D talking face generation misuse
Protect 3D information in videos while preserving high quality
Achieve efficient defense with minimal computational cost and degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perturbing 3D information acquisition process for video protection
Similarity-guided parameter sharing mechanism for computational efficiency
Multi-scale dual-domain attention module for spatial-frequency perturbations
🔎 Similar Papers
No similar papers found.
R
Rui-qing Sun
Beijing Institute of Technology
X
Xingshan Yao
Beijing Institute of Technology
T
Tian Lan
Alibaba International Digital Commerce
H
Hui-Yang Zhao
Beijing Institute of Technology
J
Jia-Ling Shi
Beijing Institute of Technology
C
Chen-Hao Cui
Beijing Institute of Technology
Zhijing Wu
Zhijing Wu
Beijing Institute of Technology
Information RetrievalNatural Language Processing
C
Chen Yang
Beijing Institute of Technology
Xian-Ling Mao
Xian-Ling Mao
Beijing Institute of Technology
Web Data MiningInformation ExtractionQA & DialogueTopic ModelingLearn to Hashing