Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

📅 2026-05-18
📈 Citations: 0
✹ Influential: 0
📄 PDF

career value

184K/year
đŸ€– AI Summary
This work addresses the challenge of accurately segmenting articulatory organs in real-time MRI (rtMRI), which is hindered by low image contrast, rapid motion, and limited spatial resolution. The authors propose a three-stage framework that leverages both speech and phonological supervision during training but requires only rtMRI images during inference. Key innovations include transforming phonological representations into spatial priors, designing a dual-level cross-modal contrastive pretraining strategy to align visual and acoustic encoders, and introducing a cross-attention decoder to effectively fuse multimodal representations. This enables successful knowledge transfer from multimodal training to unimodal inference. Evaluated on the 75-Speaker~Annot-16 and USC-TIMIT datasets, the method significantly outperforms existing unimodal and multimodal approaches, achieving clinically viable, precise vocal tract segmentation without any audio input at test time.
📝 Abstract
Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a single-modality inference pipeline. Evaluated on 75-Speaker~Annot-16 and USC-TIMIT datasets, our method outperforms existing unimodal and multimodal methods, demonstrating that multimodal supervision provides transferable benefits for precise and clinically deployable vocal tract segmentation.
Problem

Research questions and friction points this paper is trying to address.

vocal tract segmentation
real-time MRI
multimodal learning
dynamic image segmentation
speech-guided
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal learning
cross-modal contrastive pretraining
cross-attention decoder
phonological priors
real-time MRI segmentation
🔎 Similar Papers
No similar papers found.
D
Daiqi Liu
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Erlangen, Germany
L
Lukas Mulzer
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Erlangen, Germany
M
Md Hasan
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Erlangen, Germany
N
Nyvenn de Castro
Smart Imaging Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Erlangen, Germany
Fangxu Xing
Fangxu Xing
Harvard Medical School, Massachusetts General Hospital
Image AnalysisArtificial IntelligenceDeep LearningMachine LearningComputer Vision
X
Xingjian Kang
Center for AI and Data Science, Julius-Maximilians-UniversitĂ€t WĂŒrzburg, WĂŒrzburg, Germany
C
Chengze Ye
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Erlangen, Germany
Siyuan Mei
Siyuan Mei
Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg
medical image processingfoundation modelsdiffusion models
Yipeng Sun
Yipeng Sun
Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg
Deep LearningImage ProcessingInverse Problem
T
TomĂĄs Arias-Vergara
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Erlangen, Germany; GITA Lab, Facultad de IngenierĂ­a, Universidad de Antioquia UdeA, MedellĂ­n, Colombia
Jana Hutter
Jana Hutter
UKER/FAU Erlangen // King's College London
Magnetic Resonance ImagingPerinatal ImagingQuantitative Imaging
Jonghye Woo
Jonghye Woo
Associate Professor of Radiology, Harvard Medical School | MGH
Medical Image AnalysisMedical ImagingComputer VisionMachine LearningSpeech
A
Andreas Maier
Pattern Recognition Lab, Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg, Erlangen, Germany
Paula Andrea Pérez-Toro
Paula Andrea Pérez-Toro
Friedrich-Alexander-UniversitĂ€t Erlangen-NĂŒrnberg; Universidad de Antioquia
Machine LearningSpeech AnalysisGait AnalysisNatural Language ProcessingDeep Learning