Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study addresses phoneme recognition from real-time magnetic resonance imaging (rtMRI) video sequences, aiming to extract concise and interpretable spatiotemporal dynamics of articulatory organs. We propose a multimodal feature fusion framework that jointly leverages linguistically motivated regions of interest (ROIs)—specifically targeting the tongue, lips, and other key articulators—and raw video frames, augmented with optical flow to model articulatory motion. A lightweight temporal deep learning architecture is introduced to capture sequential dependencies, and ablation studies systematically validate the contribution of each component to temporal modeling. Compared to single-modality baselines, our approach significantly improves interpretability and cross-subject generalization, achieving a minimum phoneme error rate of 0.34. Results demonstrate that fine-grained modeling of articulatory dynamics is critical for robust speech-to-image mapping in rtMRI-based phoneme recognition.

Technology Category

Application Category

📝 Abstract

Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.

Problem

Research questions and friction points this paper is trying to address.

Modeling articulatory dynamics from noisy rtMRI videos

Comparing feature types for phoneme recognition accuracy

Evaluating interpretable representations of vocal tract movements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining ROI and raw video features

Leveraging multi-feature models for recognition

Using interpretable articulatory dynamics representation

🔎 Similar Papers

No similar papers found.