🤖 AI Summary
Deepfake attacks generated by voice conversion and TTS synthesis pose serious threats to automatic speaker verification (ASV) systems. To address this, we propose an enhanced AASIST anti-spoofing architecture. Our method retains the frozen Wav2Vec 2.0 encoder to preserve robust self-supervised speech representations; replaces the original graph attention module with normalized multi-head attention and introduces heterogeneous query projections to enhance feature discriminability; and designs a trainable, context-aware frame-segment fusion layer to improve modeling and ensemble capability in low-resource scenarios. Ablation studies validate the effectiveness of each component. Evaluated on the ASVspoof 2021 dataset, our model achieves an equal error rate (EER) of 7.6%, significantly outperforming the baseline AASIST. The implementation is publicly available.
📝 Abstract
Advances in voice conversion and text-to-speech synthesis have made automatic speaker verification (ASV) systems more susceptible to spoofing attacks. This work explores modest refinements to the AASIST anti-spoofing architecture. It incorporates a frozen Wav2Vec 2.0 encoder to retain self-supervised speech representations in limited-data settings, substitutes the original graph attention block with a standardized multi-head attention module using heterogeneous query projections, and replaces heuristic frame-segment fusion with a trainable, context-aware integration layer. When evaluated on the ASVspoof 5 corpus, the proposed system reaches a 7.6% equal error rate (EER), improving on a re-implemented AASIST baseline under the same training conditions. Ablation experiments suggest that each architectural change contributes to the overall performance, indicating that targeted adjustments to established models may help strengthen speech deepfake detection in practical scenarios. The code is publicly available at https://github.com/KORALLLL/AASIST_SCALING.