PACE: Pretrained Audio Continual Learning

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenges of continual learning in pre-trained audio models, which are highly susceptible to time-varying data distributions. Existing parameter-efficient fine-tuning (PEFT) methods, primarily designed for vision tasks, often fail to achieve effective semantic alignment in audio due to their neglect of the audio backbone’s reliance on low-level spectral details. To this end, we establish the first audio continual learning benchmark, revealing critical issues of representation saturation and drift. We propose PACE, a novel framework that integrates a regularized analytical classifier, adaptive subspace-orthogonal PEFT, and spectral boundary-aware perturbations to enhance both semantic alignment and learning stability. Extensive experiments across six diverse audio benchmarks demonstrate that PACE significantly outperforms current state-of-the-art methods, underscoring its robustness and scalability.

Technology Category

Application Category

📝 Abstract

Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.

Problem

Research questions and friction points this paper is trying to address.

audio continual learning

pretrained audio models

representation drift

representation saturation

semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual learning

pretrained audio models

parameter-efficient fine-tuning