Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the challenge of localizing and identifying fine-grained, multi-region speech forgeries—such as word-level replacements—when the number and locations of manipulated segments are unknown. To this end, the authors introduce MIST, a multilingual dataset, and propose ISA, a backbone-agnostic iterative segmentation analysis framework. They also design SF1@τ, a temporal IoU–based evaluation metric that jointly accounts for the number of forged regions and boundary precision. Leveraging LLM-guided semantic substitution, neural voice cloning, and a coarse-to-fine sliding-window classification strategy, ISA achieves, for the first time, unsupervised localization of multi-region word-level forgeries without prior knowledge. Experiments reveal that existing deepfake detectors suffer significant performance degradation on MIST, whereas ISA accurately localizes multiple minute forged segments containing only 2–7% altered content, substantially outperforming non-iterative baselines.

📝 Abstract

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

Problem

Research questions and friction points this paper is trying to address.

speech inpainting

audio deepfake

tampering localization

fine-grained forensics

multiregion tampering

Innovation

Methods, ideas, or system contributions that make the work stand out.

speech inpainting forensics

multi-region tampering localization

iterative segment analysis