Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the degradation in robustness of automatic speech recognition (ASR) systems caused by post-exercise physiological changes—such as disordered breathing, vocal instability, and non-semantic pauses—and presents the first systematic evaluation of mainstream acoustic foundation models under this challenging condition. The assessment encompasses both out-of-the-box and domain-finetuned settings across fluent and non-fluent speakers, covering sequence-to-sequence models (e.g., Whisper, FunASR/Paraformer) and self-supervised encoders (e.g., Wav2Vec2, HuBERT, WavLM) paired with CTC decoding. Experimental results show that FunASR achieves the best performance on the Post-All test set (WER 14.57%, CER 8.21%); CTC-based models benefit substantially from finetuning, whereas Whisper exhibits inconsistent gains. Recognition of non-fluent speakers proves significantly more difficult, highlighting the critical influence of model architecture, finetuning strategy, and speaker fluency on ASR performance in post-exercise scenarios.

Technology Category

Application Category

📝 Abstract

Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. We compare sequence-to-sequence models (Whisper and FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, and WavLM), under both off-the-shelf inference and post-exercise in-domain fine-tuning. Across the Static/Post-All benchmark, most models degrade on post-exercise speech, while FunASR shows the strongest baseline robustness at 14.57% WER and 8.21% CER on Post-All. Fine-tuning substantially improves several CTC-based models, whereas Whisper shows unstable adaptation. As an exploratory case study, we further stratify results by fluent and non-fluent speakers; although the non-fluent subset is small, it is consistently more challenging than the fluent subset. Overall, our findings show that post-exercise ASR robustness is strongly model-dependent, that in-domain adaptation can be highly effective but not uniformly stable, and that future post-exercise ASR studies should explicitly separate fluency-related effects from exercise-induced speech variation.

Problem

Research questions and friction points this paper is trying to address.

post-exercise speech

automatic speech recognition

robustness

physiological shift

speech variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-exercise speech

acoustic foundation models

robustness