BreathNet: Generalizable Audio Deepfake Detection via Breath-Cue-Guided Feature Refinement

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of current audio deepfake detection methods, which struggle to leverage fine-grained physiological cues against increasingly realistic synthetic speech. The authors propose BreathNet, a novel framework that explicitly incorporates breath sounds as a physiological prior. It introduces a BreathFiLM module that dynamically modulates time-series features extracted by XLS-R based on the presence or absence of breath, while also integrating spectral features to capture vocoder artifacts. A multi-task feature learning strategy combines positive-sample supervised contrastive loss, center loss, and contrastive loss to enhance separability between genuine and spoofed samples in the feature space. The method achieves state-of-the-art performance across five benchmark datasets, yielding an average EER of 1.99% on four evaluation sets under the ASVspoof 2019 LA protocol, and EERs of 4.70% and 4.94% under the In-the-Wild and ASVspoof5 latest protocols, respectively.

Technology Category

Application Category

📝 Abstract
As deepfake audio becomes more realistic and diverse, developing generalizable countermeasure systems has become crucial. Existing detection methods primarily depend on XLS-R front-end features to improve generalization. Nonetheless, their performance remains limited, partly due to insufficient attention to fine-grained information, such as physiological cues or frequency-domain features. In this paper, we propose BreathNet, a novel audio deepfake detection framework that integrates fine-grained breath information to improve generalization. Specifically, we design BreathFiLM, a feature-wise linear modulation mechanism that selectively amplifies temporal representations based on the presence of breathing sounds. BreathFiLM is trained jointly with the XLS-R extractor, in turn encouraging the extractor to learn and encode breath-related cues into the temporal features. Then, we use the frequency front-end to extract spectral features, which are then fused with temporal features to provide complementary information introduced by vocoders or compression artifacts. Additionally, we propose a group of feature losses comprising Positive-only Supervised Contrastive Loss (PSCL), center loss, and contrast loss. These losses jointly enhance the discriminative ability, encouraging the model to separate bona fide and deepfake samples more effectively in the feature space. Extensive experiments on five benchmark datasets demonstrate state-of-the-art (SOTA) performance. Using the ASVspoof 2019 LA training set, our method attains 1.99% average EER across four related eval benchmarks, with particularly strong performance on the In-the-Wild dataset, where it achieves 4.70% EER. Moreover, under the ASVspoof5 evaluation protocol, our method achieves an EER of 4.94% on this latest benchmark.
Problem

Research questions and friction points this paper is trying to address.

audio deepfake detection
generalization
fine-grained cues
breath cues
temporal features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Breath-cue-guided
Feature-wise Linear Modulation
Generalizable Deepfake Detection
Supervised Contrastive Learning
Multimodal Feature Fusion
🔎 Similar Papers
No similar papers found.
Z
Zhe Ye
Guangdong Key Laboratory of Information Security, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
Xiangui Kang
Xiangui Kang
Professor Xiangui Kang, Sun Yat-Sen University, China
multimedia signal processingcommunication and game theory
J
Jiayi He
State Key Laboratory of Multi-modal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
C
Chengxin Chen
China Mobile Internet Corporation, China Mobile Communications Corporation, Guangzhou 510630, China
W
Wei Zhu
China Mobile Internet Corporation, China Mobile Communications Corporation, Guangzhou 510630, China
K
Kai Wu
China Mobile Internet Corporation, China Mobile Communications Corporation, Guangzhou 510630, China
Y
Yin Yang
China Mobile Internet Corporation, China Mobile Communications Corporation, Guangzhou 510630, China
Jiwu Huang
Jiwu Huang
Shenzhen MSU-BIT University
Multimedia forensics and security