🤖 AI Summary
This work addresses the poor interpretability and black-box decision-making prevalent in deepfake audio detection by proposing a multi-task Transformer architecture that jointly discriminates genuine from spoofed speech while predicting formant trajectories and phonation patterns. By incorporating an intrinsic interpretability mechanism, optimizing input segmentation strategies, and refining the decoding process, the model achieves high detection performance with reduced parameter count and training time. Built upon an enhanced Speaker-Formant Transformer, the approach integrates temporal formant modeling, multi-task learning, and attention visualization to significantly improve model transparency. The method outperforms baseline models without compromising detection accuracy, offering a more interpretable and efficient solution for deepfake audio forensics.
📝 Abstract
In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.