Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dynamic MRI-based speech acquisition suffers from data incompleteness, noise corruption, and audio degradation due to environmental interference in the MRI scanner. Method: We propose a two-stage speech generation framework integrating knowledge enhancement and variational inference: (1) a Knowledge-Enhanced Conditional Variational Autoencoder (KE-CVAE) that jointly incorporates anatomical priors and acoustic constraints; (2) unsupervised feature pretraining using unlabeled MRI data to strengthen prior modeling; and (3) dynamic temporal alignment coupled with MRI-specific acoustic modeling. Results: Evaluated on an open-source dynamic vocal tract MRI speech dataset, our method significantly outperforms existing deep learning approaches. Generated speech exhibits high naturalness and robustly mitigates MRI-induced acoustic distortions and data incompleteness. This work establishes a novel paradigm for clinical speech reconstruction and neuroimaging-based speech decoding.

Technology Category

Application Category

📝 Abstract
Dynamic Magnetic Resonance Imaging (MRI) of the vocal tract has become an increasingly adopted imaging modality for speech motor studies. Beyond image signals, systematic data loss, noise pollution, and audio file corruption can occur due to the unpredictability of the MRI acquisition environment. In such cases, generating audio from images is critical for data recovery in both clinical and research applications. However, this remains challenging due to hardware constraints, acoustic interference, and data corruption. Existing solutions, such as denoising and multi-stage synthesis methods, face limitations in audio fidelity and generalizability. To address these challenges, we propose a Knowledge Enhanced Conditional Variational Autoencoder (KE-CVAE), a novel two-step"knowledge enhancement + variational inference"framework for generating speech audio signals from cine dynamic MRI sequences. This approach introduces two key innovations: (1) integration of unlabeled MRI data for knowledge enhancement, and (2) a variational inference architecture to improve generative modeling capacity. To the best of our knowledge, this is one of the first attempts at synthesizing speech audio directly from dynamic MRI video sequences. The proposed method was trained and evaluated on an open-source dynamic vocal tract MRI dataset recorded during speech. Experimental results demonstrate its effectiveness in generating natural speech waveforms while addressing MRI-specific acoustic challenges, outperforming conventional deep learning-based synthesis approaches.
Problem

Research questions and friction points this paper is trying to address.

Generates speech audio from dynamic MRI sequences
Addresses data loss and noise in MRI environments
Improves audio fidelity and generalizability using KE-CVAE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Enhanced Conditional Variational Autoencoder
Unlabeled MRI data integration
Variational inference for generative modeling
🔎 Similar Papers
No similar papers found.