🤖 AI Summary
This paper addresses two key challenges in environmental sound deepfake detection: unknown-generator generalization (Track 1) and black-box, low-resource deployment (Track 2). To this end, we propose a lightweight and robust detection framework. Our method leverages a frozen self-supervised audio encoder coupled with a compact feed-forward network, ensuring high efficiency without sacrificing accuracy. A core contribution is the novel SSLAM intermediate-layer feature fusion mechanism—operating on layers 4–9—to enhance cross-generator representation consistency. Additionally, we introduce class-weighted loss to mitigate severe data imbalance. Evaluated on the ESDD 2026 benchmark across both tracks, our approach achieves equal error rates of 1.20% (Track 1) and 1.05% (Track 2), substantially outperforming the official baseline. These results demonstrate strong generalization to unseen generative models and practical efficacy under stringent resource constraints.
📝 Abstract
Recent advances in generative audio models have enabled high-fidelity environmental sound synthesis, raising serious concerns for audio security. The ESDD 2026 Challenge therefore addresses environmental sound deepfake detection under unseen generators (Track 1) and black-box low-resource detection (Track 2) conditions. We propose EnvSSLAM-FFN, which integrates a frozen SSLAM self-supervised encoder with a lightweight FFN back-end. To effectively capture spoofing artifacts under severe data imbalance, we fuse intermediate SSLAM representations from layers 4-9 and adopt a class-weighted training objective. Experimental results show that the proposed system consistently outperforms the official baselines on both tracks, achieving Test Equal Error Rates (EERs) of 1.20% and 1.05%, respectively.