🤖 AI Summary
Existing self-supervised audio representation models are typically designed for single domains (e.g., speech or music), limiting their cross-domain generalization. To address this, we propose USAD—a Unified Self-supervised Audio representation framework—that integrates knowledge from domain-specific self-supervised models (speech, environmental sounds, and music) via layer-wise knowledge distillation, yielding a single-encoder, multi-task-compatible architecture. USAD leverages large-scale mixed-audio corpora and jointly optimizes self-supervised pretraining with cross-domain knowledge transfer to learn robust, general-purpose features. Evaluated on the SUPERB and HEAR benchmarks, USAD achieves near-state-of-the-art performance across diverse downstream tasks—including frame-level (e.g., ASR), instance-level (e.g., audio classification), audio tagging, and sound event detection—while significantly improving cross-domain representation consistency and downstream task generalization.
📝 Abstract
Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.