🤖 AI Summary
This study addresses the challenge of accurately inverting articulatory movements (tongue/lip) from acoustic signals in speech production. We propose a stacked BiLSTM-CNN architecture with fixed-weight initialization, trained on multi-speaker electromagnetic articulography (EMA) data. The model leverages bidirectional LSTMs to capture temporal dynamics and 1D CNNs to enhance local articulatory feature representation. Evaluation employs a comprehensive multi-paradigm framework—speaker-dependent (SD), speaker-independent (SI), cross-dataset (CD), and cross-corpus (CC). Our key contribution is the novel fixed-weight initialization strategy, which drastically mitigates overfitting within very few training epochs and substantially improves generalization across speakers and corpora. Experiments on multi-source EMA datasets demonstrate faster convergence, superior robustness, and consistently higher accuracy than adaptive-weight baselines. This work establishes a new, interpretable, and highly generalizable paradigm for modeling speech production mechanisms and enabling high-fidelity articulatory-to-acoustic synthesis.
📝 Abstract
Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.