Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deepfake attacks generated by voice conversion and TTS synthesis pose serious threats to automatic speaker verification (ASV) systems. To address this, we propose an enhanced AASIST anti-spoofing architecture. Our method retains the frozen Wav2Vec 2.0 encoder to preserve robust self-supervised speech representations; replaces the original graph attention module with normalized multi-head attention and introduces heterogeneous query projections to enhance feature discriminability; and designs a trainable, context-aware frame-segment fusion layer to improve modeling and ensemble capability in low-resource scenarios. Ablation studies validate the effectiveness of each component. Evaluated on the ASVspoof 2021 dataset, our model achieves an equal error rate (EER) of 7.6%, significantly outperforming the baseline AASIST. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Advances in voice conversion and text-to-speech synthesis have made automatic speaker verification (ASV) systems more susceptible to spoofing attacks. This work explores modest refinements to the AASIST anti-spoofing architecture. It incorporates a frozen Wav2Vec 2.0 encoder to retain self-supervised speech representations in limited-data settings, substitutes the original graph attention block with a standardized multi-head attention module using heterogeneous query projections, and replaces heuristic frame-segment fusion with a trainable, context-aware integration layer. When evaluated on the ASVspoof 5 corpus, the proposed system reaches a 7.6% equal error rate (EER), improving on a re-implemented AASIST baseline under the same training conditions. Ablation experiments suggest that each architectural change contributes to the overall performance, indicating that targeted adjustments to established models may help strengthen speech deepfake detection in practical scenarios. The code is publicly available at https://github.com/KORALLLL/AASIST_SCALING.
Problem

Research questions and friction points this paper is trying to address.

Enhancing AASIST for better speech deepfake detection
Improving spoofing attack resistance in ASV systems
Optimizing graph attention in limited-data settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frozen Wav2Vec 2.0 encoder for self-supervised speech
Standardized multi-head attention with heterogeneous queries
Trainable context-aware fusion layer replaces heuristics
🔎 Similar Papers
I
Ivan Viakhirev
Speechka.ai, Information Technologies, Mechanics and Optics University, St. Petersburg, Russian Federation
D
Daniil Sirota
St. Petersburg State University, St. Petersburg, Russian Federation
A
Aleksandr Smirnov
Speechka.ai, Information Technologies, Mechanics and Optics University, St. Petersburg, Russian Federation
Kirill Borodin
Kirill Borodin
MTUCI
deep learning for audiogen AIsafe AI