🤖 AI Summary
This study addresses the fine-grained detection of semantic, respiratory, and hybrid (respiratory-semantic) pauses in post-exercise speech, along with exertion-level classification. We introduce the first multi-type pause annotation dataset for post-exercise speech and a hierarchical cascaded modeling framework. Methodologically, we integrate Wav2Vec2’s hierarchical representations with conventional acoustic features (MFCCs/MFBs), and design both single-model and two-stage cascaded architectures compatible with GRU, CNN-LSTM, AlexNet, and VGG16 for joint multi-task learning. Our key contribution lies in the first systematic annotation and joint recognition of all three pause types in post-exercise speech, enhanced by feature–task co-design to improve generalization. Experiments show pause detection accuracies of 89% (semantic), 55% (respiratory), 86% (hybrid), and 73% overall; exertion-level classification achieves 90.5% accuracy—substantially outperforming prior approaches.
📝 Abstract
Post-exercise speech contains rich physiological and linguistic cues, often marked by semantic pauses, breathing pauses, and combined breathing-semantic pauses. Detecting these events enables assessment of recovery rate, lung function, and exertion-related abnormalities. However, existing works on identifying and distinguishing different types of pauses in this context are limited. In this work, building on a recently released dataset with synchronized audio and respiration signals, we provide systematic annotations of pause types. Using these annotations, we systematically conduct exploratory breathing and semantic pause detection and exertion-level classification across deep learning models (GRU, 1D CNN-LSTM, AlexNet, VGG16), acoustic features (MFCC, MFB), and layer-stratified Wav2Vec2 representations. We evaluate three setups-single feature, feature fusion, and a two-stage detection-classification cascade-under both classification and regression formulations. Results show per-type detection accuracy up to 89$%$ for semantic, 55$%$ for breathing, 86$%$ for combined pauses, and 73$%$overall, while exertion-level classification achieves 90.5$%$ accuracy, outperformin prior work.