EnvSSLAM-FFN: Lightweight Layer-Fused System for ESDD 2026 Challenge

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two key challenges in environmental sound deepfake detection: unknown-generator generalization (Track 1) and black-box, low-resource deployment (Track 2). To this end, we propose a lightweight and robust detection framework. Our method leverages a frozen self-supervised audio encoder coupled with a compact feed-forward network, ensuring high efficiency without sacrificing accuracy. A core contribution is the novel SSLAM intermediate-layer feature fusion mechanism—operating on layers 4–9—to enhance cross-generator representation consistency. Additionally, we introduce class-weighted loss to mitigate severe data imbalance. Evaluated on the ESDD 2026 benchmark across both tracks, our approach achieves equal error rates of 1.20% (Track 1) and 1.05% (Track 2), substantially outperforming the official baseline. These results demonstrate strong generalization to unseen generative models and practical efficacy under stringent resource constraints.

Technology Category

Application Category

📝 Abstract
Recent advances in generative audio models have enabled high-fidelity environmental sound synthesis, raising serious concerns for audio security. The ESDD 2026 Challenge therefore addresses environmental sound deepfake detection under unseen generators (Track 1) and black-box low-resource detection (Track 2) conditions. We propose EnvSSLAM-FFN, which integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end. To effectively capture spoofing artifacts under severe data imbalance, we fuse intermediate SSLAM representations from layers 4-9 and adopt a class-weighted training objective. Experimental results show that the proposed system consistently outperforms the official baselines on both tracks, achieving Test Equal Error Rates (EERs) of 1.20% and 1.05%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Detect environmental sound deepfakes from unseen generators
Perform black-box detection with limited training resources
Address severe data imbalance in audio spoofing artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frozen SSLAM encoder with lightweight FFN back-end
Fused intermediate SSLAM representations from layers 4-9
Class-weighted training objective for data imbalance
🔎 Similar Papers
No similar papers found.
X
Xiaoxuan Guo
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
Hengyan Huang
Hengyan Huang
Pursuing a degree in Intelligent Science and Technology at the Communication University of China.
AIGCMLLMAudio-Visual Processing
J
Jiayi Zhou
Machine Intelligence, Ant Group, Shanghai, China
R
Renhe Sun
Machine Intelligence, Ant Group, Shanghai, China
J
Jian Liu
Machine Intelligence, Ant Group, Shanghai, China
H
Haonan Cheng
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
Long Ye
Long Ye
Communication University of China
Multimedia Signal ProcessingArtificial Intelligence
Q
Qin Zhang
Key Laboratory of Media Audio & Video, Ministry of Education, Communication University of China, Beijing, China