🤖 AI Summary
This study addresses the challenge of localizing and identifying fine-grained, multi-region speech forgeries—such as word-level replacements—when the number and locations of manipulated segments are unknown. To this end, the authors introduce MIST, a multilingual dataset, and propose ISA, a backbone-agnostic iterative segmentation analysis framework. They also design SF1@τ, a temporal IoU–based evaluation metric that jointly accounts for the number of forged regions and boundary precision. Leveraging LLM-guided semantic substitution, neural voice cloning, and a coarse-to-fine sliding-window classification strategy, ISA achieves, for the first time, unsupervised localization of multi-region word-level forgeries without prior knowledge. Experiments reveal that existing deepfake detectors suffer significant performance degradation on MIST, whereas ISA accurately localizes multiple minute forged segments containing only 2–7% altered content, substantially outperforming non-iterative baselines.
📝 Abstract
Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.