Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor interpretability and black-box decision-making prevalent in deepfake audio detection by proposing a multi-task Transformer architecture that jointly discriminates genuine from spoofed speech while predicting formant trajectories and phonation patterns. By incorporating an intrinsic interpretability mechanism, optimizing input segmentation strategies, and refining the decoding process, the model achieves high detection performance with reduced parameter count and training time. Built upon an enhanced Speaker-Formant Transformer, the approach integrates temporal formant modeling, multi-task learning, and attention visualization to significantly improve model transparency. The method outperforms baseline models without compromising detection accuracy, offering a more interpretable and efficient solution for deepfake audio forensics.

Technology Category

Application Category

📝 Abstract
In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.
Problem

Research questions and friction points this paper is trying to address.

speech deepfake detection
explainability
formant modeling
multi-task learning
voice authenticity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Task Transformer
Formant Modeling
Explainable AI
Speech Deepfake Detection
Voicing Pattern
🔎 Similar Papers
No similar papers found.
Viola Negroni
Viola Negroni
Politecnico di Milano
Multimedia ForensicsAudio Signal ProcessingDeepfake Detection
L
L. Cuccovillo
Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany
P
P. Bestagini
Dipartimento di Elettronica, Informazione e Bioingegneria - Politecnico di Milano - Milan, Italy
P
P. Aichroth
Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany
Stefano Tubaro
Stefano Tubaro
Politecnico di Milano, DEI