E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Speech foundation models suffer significant performance degradation under real-world acoustic domain shifts (e.g., noise, accents). Existing test-time adaptation (TTA) methods either incur high GPU memory overhead due to backpropagation or achieve insufficient accuracy without it; moreover, most are designed for vision tasks and fail to accommodate speech-specific modeling characteristics. This work introduces the first backpropagation-free TTA framework tailored for speech tasks, featuring lightweight prompt alignment, multi-scale contrastive loss—jointly modeling utterance- and token-level distribution shifts—and a test-time exponential moving average mechanism. Evaluated across 16 acoustic conditions and 4 noisy speech datasets, our method improves accuracy by 4.1–13.5% over backpropagation-free baselines and reduces GPU memory consumption by 2.0–6.4× compared to backpropagation-based approaches, enabling efficient online adaptation without source data or labels.

Technology Category

Application Category

📝 Abstract

Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.

Problem

Research questions and friction points this paper is trying to address.

Address performance drop in speech models from domain shifts

Reduce memory-intensive backpropagation in test-time adaptation

Improve accuracy of backpropagation-free methods for speech tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight prompt adaptation for feature alignment

Multi-scale loss for global and local shifts

Test-time exponential moving average mechanism

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation