Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This study addresses the detection of synthetic replacement words (deepfake words) in speech by proposing a novel approach based on fine-tuning the Whisper model. Leveraging Whisper’s next-token prediction mechanism during transcription, the method identifies anomalous synthetic tokens and incorporates partially vocoder-processed speech data for augmentation, substantially reducing annotation costs. This work is the first to apply Whisper’s next-token prediction capability to deepfake word detection, achieving simultaneously low transcription and detection error rates on in-domain data. On out-of-domain data, the approach matches the performance of specialized ResNet-based detection models, demonstrating strong practical potential, although room remains for improving generalization.

Technology Category

Application Category

📝 Abstract

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models. While a dedicated synthetic word detector could be developed, we investigate a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thereby reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced by unseen speech generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.

Problem

Research questions and friction points this paper is trying to address.

Deepfake

Synthetic word detection

Speech forgery

Next-token prediction

Generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

next-token prediction

fine-tuned Whisper

deepfake word detection