Audio-Vision Contrastive Learning for Phonological Class Recognition

📅 2025-07-23
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
Clinical speech analysis requires accurate automatic classification of articulatory–phonemic features—such as manner, place, and voicing—to advance understanding of speech production mechanisms and enable personalized speech rehabilitation. Method: We propose a contrastive learning–driven audiovisual multimodal deep learning framework that jointly models real-time magnetic resonance imaging (rtMRI) and synchronized acoustic signals to learn cross-modal consistent representations. Contribution/Results: Our approach significantly enhances discriminability along articulatory dimensions compared to unimodal baselines and conventional fusion methods. Evaluated on the USC-TIMIT dataset, it achieves a mean F1-score of 0.81—representing an absolute improvement of 0.23 over the best unimodal baseline—and establishes new state-of-the-art performance. This demonstrates the efficacy and robustness of contrastive-driven cross-modal representation learning for clinically relevant articulatory analysis.

Technology Category

Application Category

📝 Abstract
Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance, with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline. The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis. Our code and processed dataset will be made publicly available at https://github.com/DaE-plz/AC_Contrastive_Phonology to support future research.
Problem

Research questions and friction points this paper is trying to address.

Classify articulatory-phonological features using multimodal deep learning
Improve speech technology with contrastive learning-based audio-vision fusion
Enhance clinical speech analysis and personalized rehabilitation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal deep learning with rtMRI and audio
Contrastive learning for audio-vision fusion
Classifies 15 phonological articulatory features
🔎 Similar Papers
No similar papers found.
D
Daiqi Liu
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Germany
T
TomĂĄs Arias-Vergara
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Germany
Jana Hutter
Jana Hutter
UKER/FAU Erlangen // King's College London
Magnetic Resonance ImagingPerinatal ImagingQuantitative Imaging
A
Andreas Maier
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Germany
Paula Andrea Pérez-Toro
Paula Andrea Pérez-Toro
Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg; Universidad de Antioquia
Machine LearningSpeech AnalysisGait AnalysisNatural Language ProcessingDeep Learning