LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of deepfake detection and temporal localization for long-duration, multi-speaker, noisy speech in complex real-world scenarios. We propose the first end-to-end trainable and evaluable framework explicitly designed for robustness assessment. Our method introduces (1) a controllable synthesis pipeline that generates realistic fake speech exhibiting simultaneous characteristics of extended duration, environmental noise, and multiple speakers—establishing a more deployment-relevant benchmark; and (2) a self-supervised frontend coupled with a lightweight backend, integrated with targeted data augmentation strategies to enhance robustness against acoustic degradation and speaker variability. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in both detection accuracy and temporal localization precision, particularly under high-noise conditions and cross-speaker generalization settings.

Technology Category

Application Category

📝 Abstract
This study introduces LENS-DF, a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization under complicated and realistic audio conditions. The generation part of the recipe outputs audios from the input dataset with several critical characteristics, such as longer duration, noisy conditions, and containing multiple speakers, in a controllable fashion. The corresponding detection and localization protocol uses models. We conduct experiments based on self-supervised learning front-end and simple back-end. The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes, demonstrating the effectiveness and usefulness of LENS-DF for robust audio deepfake detection and localization. We also conduct ablation studies on the variations introduced, investigating their impact on and relevance to realistic challenges in the field.
Problem

Research questions and friction points this paper is trying to address.

Detects deepfakes in long noisy audio
Localizes deepfake segments temporally
Handles multi-speaker complex audio conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-duration, noisy, multi-speaker audio generation
Self-supervised learning front-end for detection
Simple back-end model for localization
🔎 Similar Papers