Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses phoneme recognition from real-time magnetic resonance imaging (rtMRI) video sequences, aiming to extract concise and interpretable spatiotemporal dynamics of articulatory organs. We propose a multimodal feature fusion framework that jointly leverages linguistically motivated regions of interest (ROIs)—specifically targeting the tongue, lips, and other key articulators—and raw video frames, augmented with optical flow to model articulatory motion. A lightweight temporal deep learning architecture is introduced to capture sequential dependencies, and ablation studies systematically validate the contribution of each component to temporal modeling. Compared to single-modality baselines, our approach significantly improves interpretability and cross-subject generalization, achieving a minimum phoneme error rate of 0.34. Results demonstrate that fine-grained modeling of articulatory dynamics is critical for robust speech-to-image mapping in rtMRI-based phoneme recognition.

Technology Category

Application Category

📝 Abstract
Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.
Problem

Research questions and friction points this paper is trying to address.

Modeling articulatory dynamics from noisy rtMRI videos
Comparing feature types for phoneme recognition accuracy
Evaluating interpretable representations of vocal tract movements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining ROI and raw video features
Leveraging multi-feature models for recognition
Using interpretable articulatory dynamics representation
🔎 Similar Papers
No similar papers found.
J
Jay Park
Signal Analysis and Interpretation Lab, University of Southern California
Hong Nguyen
Hong Nguyen
PhD Student at University of Southern California
Video UnderstandingMultimodality ModelsHuman-centric AIBehavioural Models
Sean Foley
Sean Foley
Macquarie University
Applied FinanceDigital FinanceMarket MicrostructureCryptocurrenciesDeFi
J
Jihwan Lee
Signal Analysis and Interpretation Lab, University of Southern California
Y
Yoonjeong Lee
Signal Analysis and Interpretation Lab, University of Southern California
D
Dani Byrd
Department of Linguistics, University of Southern California
S
Shrikanth Narayanan
Signal Analysis and Interpretation Lab, University of Southern California