🤖 AI Summary
This study addresses the challenge of modeling long-term rhythmic abnormalities in dementia-related speech. We propose Rhythmic Formant Analysis (RFA), a novel method that constructs AM/FM rhythmic spectrograms to explicitly characterize slow temporal modulations in speech signals. Complementing this, we design handcrafted rhythmic morphological features and introduce a ViT-BERT multimodal fusion paradigm to jointly model the visual structural patterns of rhythmic spectrograms and linguistic semantic information. Experimental results demonstrate that our handcrafted features improve classification accuracy by 14.2% over the eGeMAPs baseline. Moreover, the RFA-based spectrogram fusion achieves a 13.1% gain in classification performance compared to conventional Mel-spectrograms and attains state-of-the-art performance on dementia severity regression. This work establishes a new feature representation and a principled multimodal modeling framework for non-invasive, speech-based dementia screening.
📝 Abstract
This study explores the potential of Rhythm Formant Analysis (RFA) to capture long-term temporal modulations in dementia speech. Specifically, we introduce RFA-derived rhythm spectrograms as novel features for dementia classification and regression tasks. We propose two methodologies: (1) handcrafted features derived from rhythm spectrograms, and (2) a data-driven fusion approach, integrating proposed RFA-derived rhythm spectrograms with vision transformer (ViT) for acoustic representations along with BERT-based linguistic embeddings. We compare these with existing features. Notably, our handcrafted features outperform eGeMAPs with a relative improvement of $14.2%$ in classification accuracy and comparable performance in the regression task. The fusion approach also shows improvement, with RFA spectrograms surpassing Mel spectrograms in classification by around a relative improvement of $13.1%$ and a comparable regression score with the baselines.