🤖 AI Summary
This work addresses the limited generalization of current audio deepfake detection methods, which struggle to leverage fine-grained physiological cues against increasingly realistic synthetic speech. The authors propose BreathNet, a novel framework that explicitly incorporates breath sounds as a physiological prior. It introduces a BreathFiLM module that dynamically modulates time-series features extracted by XLS-R based on the presence or absence of breath, while also integrating spectral features to capture vocoder artifacts. A multi-task feature learning strategy combines positive-sample supervised contrastive loss, center loss, and contrastive loss to enhance separability between genuine and spoofed samples in the feature space. The method achieves state-of-the-art performance across five benchmark datasets, yielding an average EER of 1.99% on four evaluation sets under the ASVspoof 2019 LA protocol, and EERs of 4.70% and 4.94% under the In-the-Wild and ASVspoof5 latest protocols, respectively.
📝 Abstract
As deepfake audio becomes more realistic and diverse, developing generalizable countermeasure systems has become crucial. Existing detection methods primarily depend on XLS-R front-end features to improve generalization. Nonetheless, their performance remains limited, partly due to insufficient attention to fine-grained information, such as physiological cues or frequency-domain features. In this paper, we propose BreathNet, a novel audio deepfake detection framework that integrates fine-grained breath information to improve generalization. Specifically, we design BreathFiLM, a feature-wise linear modulation mechanism that selectively amplifies temporal representations based on the presence of breathing sounds. BreathFiLM is trained jointly with the XLS-R extractor, in turn encouraging the extractor to learn and encode breath-related cues into the temporal features. Then, we use the frequency front-end to extract spectral features, which are then fused with temporal features to provide complementary information introduced by vocoders or compression artifacts. Additionally, we propose a group of feature losses comprising Positive-only Supervised Contrastive Loss (PSCL), center loss, and contrast loss. These losses jointly enhance the discriminative ability, encouraging the model to separate bona fide and deepfake samples more effectively in the feature space. Extensive experiments on five benchmark datasets demonstrate state-of-the-art (SOTA) performance. Using the ASVspoof 2019 LA training set, our method attains 1.99% average EER across four related eval benchmarks, with particularly strong performance on the In-the-Wild dataset, where it achieves 4.70% EER. Moreover, under the ASVspoof5 evaluation protocol, our method achieves an EER of 4.94% on this latest benchmark.