EnvSSLAM-FFN: Lightweight Layer-Fused System for ESDD 2026 Challenge

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This paper addresses two key challenges in environmental sound deepfake detection: unknown-generator generalization (Track 1) and black-box, low-resource deployment (Track 2). To this end, we propose a lightweight and robust detection framework. Our method leverages a frozen self-supervised audio encoder coupled with a compact feed-forward network, ensuring high efficiency without sacrificing accuracy. A core contribution is the novel SSLAM intermediate-layer feature fusion mechanism—operating on layers 4–9—to enhance cross-generator representation consistency. Additionally, we introduce class-weighted loss to mitigate severe data imbalance. Evaluated on the ESDD 2026 benchmark across both tracks, our approach achieves equal error rates of 1.20% (Track 1) and 1.05% (Track 2), substantially outperforming the official baseline. These results demonstrate strong generalization to unseen generative models and practical efficacy under stringent resource constraints.

Technology Category

Application Category

📝 Abstract

Recent advances in generative audio models have enabled high-fidelity environmental sound synthesis, raising serious concerns for audio security. The ESDD 2026 Challenge therefore addresses environmental sound deepfake detection under unseen generators (Track 1) and black-box low-resource detection (Track 2) conditions. We propose EnvSSLAM-FFN, which integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end. To effectively capture spoofing artifacts under severe data imbalance, we fuse intermediate SSLAM representations from layers 4-9 and adopt a class-weighted training objective. Experimental results show that the proposed system consistently outperforms the official baselines on both tracks, achieving Test Equal Error Rates (EERs) of 1.20% and 1.05%, respectively.

Problem

Research questions and friction points this paper is trying to address.

Detect environmental sound deepfakes from unseen generators

Perform black-box detection with limited training resources

Address severe data imbalance in audio spoofing artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frozen SSLAM encoder with lightweight FFN back-end

Fused intermediate SSLAM representations from layers 4-9

Class-weighted training objective for data imbalance

🔎 Similar Papers

A Survey on Reinforcement Learning Applications in SLAM