USAD: Universal Speech and Audio Representation via Distillation

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised audio representation models are typically designed for single domains (e.g., speech or music), limiting their cross-domain generalization. To address this, we propose USAD—a Unified Self-supervised Audio representation framework—that integrates knowledge from domain-specific self-supervised models (speech, environmental sounds, and music) via layer-wise knowledge distillation, yielding a single-encoder, multi-task-compatible architecture. USAD leverages large-scale mixed-audio corpora and jointly optimizes self-supervised pretraining with cross-domain knowledge transfer to learn robust, general-purpose features. Evaluated on the SUPERB and HEAR benchmarks, USAD achieves near-state-of-the-art performance across diverse downstream tasks—including frame-level (e.g., ASR), instance-level (e.g., audio classification), audio tagging, and sound event detection—while significantly improving cross-domain representation consistency and downstream task generalization.

Technology Category

Application Category

📝 Abstract
Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Unify speech and non-speech audio representation learning
Integrate diverse audio types into single model
Achieve competitive performance across multiple benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified audio representation via distillation
Layer-to-layer distillation from SSL models
Single encoder for diverse audio tasks
🔎 Similar Papers
No similar papers found.