Localizing Speech Deepfakes Beyond Transitions via Segment-Aware Learning

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limitations of existing deepfake audio localization methods, which predominantly rely on boundary artifacts between genuine and forged segments and thus struggle to detect subtle, dispersed forgeries. To overcome this dependency on transitional regions, the paper proposes a Segment-Aware Learning (SAL) framework that enhances localization by modeling the internal structure of speech segments. SAL introduces segment-position labels to enable fine-grained frame-level supervision and incorporates a cross-segment mixing data augmentation strategy to encourage the model to focus on entire forged segments rather than solely on their boundaries. Experimental results demonstrate that SAL significantly outperforms current approaches across multiple datasets in both in-domain and out-of-domain settings, with particularly notable improvements in non-boundary regions.

Technology Category

Application Category

📝 Abstract

Localizing partial deepfake audio, where only segments of speech are manipulated, remains challenging due to the subtle and scattered nature of these modifications. Existing approaches typically rely on frame-level predictions to identify spoofed segments, and some recent methods improve performance by concentrating on the transitions between real and fake audio. However, we observe that these models tend to over-rely on boundary artifacts while neglecting the manipulated content that follows. We argue that effective localization requires understanding the entire segments beyond just detecting transitions. Thus, we propose Segment-Aware Learning (SAL), a framework that encourages models to focus on the internal structure of segments. SAL introduces two core techniques: Segment Positional Labeling, which provides fine-grained frame supervision based on relative position within a segment; and Cross-Segment Mixing, a data augmentation method that generates diverse segment patterns. Experiments across multiple deepfake localization datasets show that SAL consistently achieves strong performance in both in-domain and out-of-domain settings, with notable gains in non-boundary regions and reduced reliance on transition artifacts. The code is available at https://github.com/SentryMao/SAL.

Problem

Research questions and friction points this paper is trying to address.

speech deepfake

partial deepfake audio

deepfake localization

segment manipulation

audio spoofing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-Aware Learning

deepfake audio localization

Segment Positional Labeling