SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio representation models face dual challenges—high retraining costs and catastrophic forgetting—when adapting to continuously arriving unlabeled audio data. This paper proposes a continual pre-training framework for audio representation built upon the BEATs architecture, introducing a novel tripartite synergy mechanism: (1) joint sampling of new and old data; (2) contrastive regularization balanced via self-distillation; and (3) dynamically expandable tokenizer codebook. The framework jointly optimizes domain adaptation and general representation capability without compromising original task performance. Experiments across four heterogeneous audio domains demonstrate significant gains in new-domain performance (+8.2% average accuracy), while retaining prior-task accuracy with degradation under 0.5%, confirming strong resistance to catastrophic forgetting and superior cross-domain generalization.

Technology Category

Application Category

📝 Abstract
Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency, we propose SONAR (Self-distilled cONtinual pre-training for domain adaptive Audio Representation), a continual pre-training framework built upon BEATs. SONAR effectively adapts to new domains while mitigating catastrophic forgetting by tackling three key challenges: implementing a joint sampling strategy for new and prior data, applying regularization to balance specificity and generality, and dynamically expanding the tokenizer codebook for novel acoustic patterns. Experiments across four distinct domains demonstrate that our method achieves both high adaptability and robust resistance to forgetting.
Problem

Research questions and friction points this paper is trying to address.

Adapting audio representations to new domains continually
Mitigating catastrophic forgetting in continual pre-training
Efficiently integrating new acoustic patterns without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distilled continual pre-training framework
Joint sampling strategy for data
Dynamic tokenizer expansion for acoustics
🔎 Similar Papers
No similar papers found.
Y
Yizhou Zhang
Graduate School of Informatics, Kyoto University, Japan
Y
Yuan Gao
Graduate School of Informatics, Kyoto University, Japan
Wangjin Zhou
Wangjin Zhou
Kyoto University
Deep Learning
Z
Zicheng Yuan
Graduate School of Informatics, Kyoto University, Japan
Keisuke Imoto
Keisuke Imoto
Kyoto University
Acoustic Signal ProcessingSound Event Detection
Tatsuya Kawahara
Tatsuya Kawahara
Professor, School of Informatics, Kyoto University
Speech Processingspeech recognitionNatural Language Processingdialogue