🤖 AI Summary
State-of-the-art voice editing models (e.g., Voicebox) generate highly realistic, subjectively undetectable manipulated speech, rendering existing anti-spoofing detectors ineffective. Method: We introduce SINE, the first benchmark specifically designed for detecting seamless voice editing—built upon Voicebox to produce high-fidelity, fine-grained tampered samples—and propose a context-aware joint detection-and-localization evaluation framework enabling cross-editing-method generalization. Our approach integrates self-supervised representations (Wav2Vec 2.0 and HuBERT) with multi-task modeling. Contribution/Results: The proposed self-supervised detector achieves >92% AUC on SINE, with mean editing-boundary localization error <30 ms—substantially outperforming supervised baselines—while demonstrating strong robustness in cross-model transfer. This work fills dual gaps: the absence of high-quality seamless editing evaluation data and a principled assessment paradigm.
📝 Abstract
Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A extsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.