S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the challenge of deploying large-scale general-purpose audio foundation models on resource-constrained edge devices, where conventional knowledge distillation methods—which rely on logits or intermediate feature alignment—are inapplicable to self-supervised models that output only embeddings. To overcome this limitation, the authors propose a self-supervised distillation framework that leverages solely the teacher model’s output embeddings, eliminating the need for logits or layer-wise alignment and ensuring compatibility across diverse teacher architectures. By incorporating embedding space alignment, cluster-balanced sampling, and a lightweight student design, the method successfully compresses two prominent audio foundation models to 1/61 of their original size while retaining up to 96% of their original performance.
📝 Abstract
General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.
Problem

Research questions and friction points this paper is trying to address.

audio foundation models
model compression
knowledge distillation
self-supervised learning
embedding-based models
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised knowledge distillation
audio foundation models
embedding-based distillation
model compression
architecture-agnostic
🔎 Similar Papers