USAD: Universal Speech and Audio Representation via Distillation

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing self-supervised audio representation models are typically designed for single domains (e.g., speech or music), limiting their cross-domain generalization. To address this, we propose USAD—a Unified Self-supervised Audio representation framework—that integrates knowledge from domain-specific self-supervised models (speech, environmental sounds, and music) via layer-wise knowledge distillation, yielding a single-encoder, multi-task-compatible architecture. USAD leverages large-scale mixed-audio corpora and jointly optimizes self-supervised pretraining with cross-domain knowledge transfer to learn robust, general-purpose features. Evaluated on the SUPERB and HEAR benchmarks, USAD achieves near-state-of-the-art performance across diverse downstream tasks—including frame-level (e.g., ASR), instance-level (e.g., audio classification), audio tagging, and sound event detection—while significantly improving cross-domain representation consistency and downstream task generalization.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Unify speech and non-speech audio representation learning

Integrate diverse audio types into single model

Achieve competitive performance across multiple benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified audio representation via distillation

Layer-to-layer distillation from SSL models

Single encoder for diverse audio tasks

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs