🤖 AI Summary
This study addresses phoneme recognition from real-time magnetic resonance imaging (rtMRI) video sequences, aiming to extract concise and interpretable spatiotemporal dynamics of articulatory organs. We propose a multimodal feature fusion framework that jointly leverages linguistically motivated regions of interest (ROIs)—specifically targeting the tongue, lips, and other key articulators—and raw video frames, augmented with optical flow to model articulatory motion. A lightweight temporal deep learning architecture is introduced to capture sequential dependencies, and ablation studies systematically validate the contribution of each component to temporal modeling. Compared to single-modality baselines, our approach significantly improves interpretability and cross-subject generalization, achieving a minimum phoneme error rate of 0.34. Results demonstrate that fine-grained modeling of articulatory dynamics is critical for robust speech-to-image mapping in rtMRI-based phoneme recognition.
📝 Abstract
Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.