Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses the challenge of accurately inverting articulatory movements (tongue/lip) from acoustic signals in speech production. We propose a stacked BiLSTM-CNN architecture with fixed-weight initialization, trained on multi-speaker electromagnetic articulography (EMA) data. The model leverages bidirectional LSTMs to capture temporal dynamics and 1D CNNs to enhance local articulatory feature representation. Evaluation employs a comprehensive multi-paradigm framework—speaker-dependent (SD), speaker-independent (SI), cross-dataset (CD), and cross-corpus (CC). Our key contribution is the novel fixed-weight initialization strategy, which drastically mitigates overfitting within very few training epochs and substantially improves generalization across speakers and corpora. Experiments on multi-source EMA datasets demonstrate faster convergence, superior robustness, and consistently higher accuracy than adaptive-weight baselines. This work establishes a new, interpretable, and highly generalizable paradigm for modeling speech production mechanisms and enabling high-fidelity articulatory-to-acoustic synthesis.

Technology Category

Application Category

📝 Abstract

Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.

Problem

Research questions and friction points this paper is trying to address.

Predicting tongue and lip articulatory features from speech acoustics

Using fixed-weight BiLSTM-CNN for robust articulatory dynamics tracking

Evaluating model performance across diverse datasets and speaker modes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fixed-weight BiLSTM-CNN for articulatory tracking

Combines BiLSTM and 1D CNN post-processing

Trained with multi-source EMA-speech datasets

🔎 Similar Papers

No similar papers found.