🤖 AI Summary
Audio deepfake detection models suffer from performance degradation when continuously exposed to novel attacks, while existing replay-based continual learning approaches incur catastrophic forgetting due to insufficient sample diversity in memory buffers.
Method: We propose a supplementary label-guided diversified replay mechanism. A label generation network dynamically enriches the semantic and acoustic diversity of audio representations stored in the memory buffer, thereby mitigating feature bias and alleviating knowledge forgetting. The method jointly integrates continual learning, audio representation modeling, and replay optimization.
Contribution/Results: Evaluated on a five-stage incremental attack benchmark, our approach achieves a mean equal error rate (EER) of 1.953%, substantially outperforming state-of-the-art methods. It introduces the first auxiliary information–driven replay sampling paradigm for audio deepfake detection and releases open-source code.
📝 Abstract
The performance of existing audio deepfake detection frameworks degrades when confronted with new deepfake attacks. Rehearsal-based continual learning (CL), which updates models using a limited set of old data samples, helps preserve prior knowledge while incorporating new information. However, existing rehearsal techniques don't effectively capture the diversity of audio characteristics, introducing bias and increasing the risk of forgetting. To address this challenge, we propose Rehearsal with Auxiliary-Informed Sampling (RAIS), a rehearsal-based CL approach for audio deepfake detection. RAIS employs a label generation network to produce auxiliary labels, guiding diverse sample selection for the memory buffer. Extensive experiments show RAIS outperforms state-of-the-art methods, achieving an average Equal Error Rate (EER) of 1.953 % across five experiences. The code is available at: https://github.com/falihgoz/RAIS.