Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively detecting seamless speech forgeries generated by end-to-end neural voice editing systems and the scarcity of high-quality bilingual datasets. To this end, the authors construct AiEdit, a large-scale bilingual speech editing dataset, and propose PELM—the first unified large-model framework for joint speech editing detection and content localization. PELM uniquely formulates the task as an audio question-answering problem. By incorporating a word-level probability prior and an acoustic consistency-aware loss based on centroid aggregation, the model effectively mitigates forgery bias and semantic-preference bias. Experimental results demonstrate that PELM significantly outperforms existing methods on both HumanEdit and AiEdit, achieving equal error rates (EER) of 0.57% for detection and 9.28% for localization.

Technology Category

Application Category

📝 Abstract
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57\% and 9.28\% (localization), respectively.
Problem

Research questions and friction points this paper is trying to address.

speech editing detection
content localization
neural speech editing
audio forgery
seamless acoustic transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech editing detection
content localization
audio large language model
acoustic consistency
forgery bias mitigation
🔎 Similar Papers
No similar papers found.
J
Jun Xue
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Y
Yi Chai
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Y
Yanzhen Ren
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University, Wuhan, China
J
Jinshen He
Independent Researcher
Zhiqiang Tang
Zhiqiang Tang
AWS AI
AutoMLEfficientMLRobustML
Z
Zhuolin Yi
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Y
Yihuan Huang
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Yuankun Xie
Yuankun Xie
PhD Candidate, Communication University of China
Audio Deepfake DetectionDomain GeneralizationOut-of-Distribution DetectionNeural Audio Codec
Yujie Chen
Yujie Chen
Beihang University
Knowledge Graph Completion、Knowledge Graph