Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
This study addresses the challenge of accurately inverting articulatory movements (tongue/lip) from acoustic signals in speech production. We propose a stacked BiLSTM-CNN architecture with fixed-weight initialization, trained on multi-speaker electromagnetic articulography (EMA) data. The model leverages bidirectional LSTMs to capture temporal dynamics and 1D CNNs to enhance local articulatory feature representation. Evaluation employs a comprehensive multi-paradigm framework—speaker-dependent (SD), speaker-independent (SI), cross-dataset (CD), and cross-corpus (CC). Our key contribution is the novel fixed-weight initialization strategy, which drastically mitigates overfitting within very few training epochs and substantially improves generalization across speakers and corpora. Experiments on multi-source EMA datasets demonstrate faster convergence, superior robustness, and consistently higher accuracy than adaptive-weight baselines. This work establishes a new, interpretable, and highly generalizable paradigm for modeling speech production mechanisms and enabling high-fidelity articulatory-to-acoustic synthesis.

Technology Category

Application Category

📝 Abstract
Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.
Problem

Research questions and friction points this paper is trying to address.

Predicting tongue and lip articulatory features from speech acoustics
Using fixed-weight BiLSTM-CNN for robust articulatory dynamics tracking
Evaluating model performance across diverse datasets and speaker modes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fixed-weight BiLSTM-CNN for articulatory tracking
Combines BiLSTM and 1D CNN post-processing
Trained with multi-source EMA-speech datasets
🔎 Similar Papers
No similar papers found.