RPRA-ADD: Forgery Trace Enhancement-Driven Audio Deepfake Detection

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor generalization and insufficient forgery-trace perception in deepfake audio detection, this paper proposes an end-to-end detection framework driven synergistically by reconstruction, perception, reinforcement, and attention. We innovatively design a global–local forgery perception module, a multi-stage discrete contrastive loss, and a dynamic attention mechanism that focuses on forgery traces via a reconstruction-difference matrix. The framework integrates a self-supervised reconstruction network with dual-scale feature modeling, significantly enhancing robustness against unseen spoofing techniques and cross-acoustic-domain scenarios. Evaluated on four major benchmarks—ASVspoof2019, ASVspoof2021, CodecFake, and FakeSound—the method achieves state-of-the-art performance, with average improvements exceeding 20% over prior works. In rigorous 3×3 cross-domain evaluations, it demonstrates superior robustness, outperforming all existing approaches.

Technology Category

Application Category

📝 Abstract
Existing methods for deepfake audio detection have demonstrated some effectiveness. However, they still face challenges in generalizing to new forgery techniques and evolving attack patterns. This limitation mainly arises because the models rely heavily on the distribution of the training data and fail to learn a decision boundary that captures the essential characteristics of forgeries. Additionally, relying solely on a classification loss makes it difficult to capture the intrinsic differences between real and fake audio. In this paper, we propose the RPRA-ADD, an integrated Reconstruction-Perception-Reinforcement-Attention networks based forgery trace enhancement-driven robust audio deepfake detection framework. First, we propose a Global-Local Forgery Perception (GLFP) module for enhancing the acoustic perception capacity of forgery traces. To significantly reinforce the feature space distribution differences between real and fake audio, the Multi-stage Dispersed Enhancement Loss (MDEL) is designed, which implements a dispersal strategy in multi-stage feature spaces. Furthermore, in order to enhance feature awareness towards forgery traces, the Fake Trace Focused Attention (FTFA) mechanism is introduced to adjust attention weights dynamically according to the reconstruction discrepancy matrix. Visualization experiments not only demonstrate that FTFA improves attention to voice segments, but also enhance the generalization capability. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on 4 benchmark datasets, including ASVspoof2019, ASVspoof2021, CodecFake, and FakeSound, achieving over 20% performance improvement. In addition, it outperforms existing methods in rigorous 3*3 cross-domain evaluations across Speech, Sound, and Singing, demonstrating strong generalization capability across diverse audio domains.
Problem

Research questions and friction points this paper is trying to address.

Enhancing detection of evolving deepfake audio techniques
Improving generalization across diverse forgery patterns
Strengthening feature space differences for real vs fake audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-Local Forgery Perception module enhances acoustic traces
Multi-stage Dispersed Enhancement Loss reinforces feature differences
Fake Trace Focused Attention adjusts weights dynamically
🔎 Similar Papers
No similar papers found.
Ruibo Fu
Ruibo Fu
Associate Professor,CASIA
AIGCLMMIntelligent speech interactionDeepfake detection
Xiaopeng Wang
Xiaopeng Wang
Institute of Automation, Chinese Academy of Sciences
Fake Audio DetectionText To SpeechSpeech Large Model
Zhengqi Wen
Zhengqi Wen
Tshinghua University
LLM
J
Jianhua Tao
Department of Automation, Tsinghua University, Beijing 100084, China
Yuankun Xie
Yuankun Xie
PhD Candidate, Communication University of China
Audio Deepfake DetectionDomain GeneralizationOut-of-Distribution DetectionNeural Audio Codec
Z
Zhiyong Wang
School of Artificial Intelligence, Chinese Academy of Sciences, Beijing, China
Chunyu Qiang
Chunyu Qiang
Kuaishou Technology; TJU; CASIA
Speech Synthesis
X
Xuefei Liu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
C
Cunhang Fan
Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China
C
Chenxing Li
Tencent, AI Lab, Beijing, China
Guanjun Li
Guanjun Li
Institute of Automation,Chinese Academy of Sciences
Audio ProcessingAudio-visual Leanring