Lightweight and perceptually-guided voice conversion for electro-laryngeal speech

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Electrolaryngeal (EL) speech suffers from poor naturalness and intelligibility due to its constant pitch, limited prosody, and mechanical noise. This work proposes the first adaptation of the lightweight voice conversion framework StreamVC for EL speech rehabilitation, removing pitch and energy modeling modules and instead integrating WavLM features, self-supervised pretraining, and supervised fine-tuning on parallel EL and healthy speech data. The approach further incorporates perceptual loss and human feedback prediction into a joint optimization objective. The proposed method substantially improves speech quality, with the best-performing model (+WavLM+HF) achieving a significant reduction in character error rate and elevating naturalness MOS from 1.1 to 3.3—bringing multiple metrics close to those of healthy human speech.

Technology Category

Application Category

📝 Abstract
Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL and healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.
Problem

Research questions and friction points this paper is trying to address.

electro-laryngeal speech
voice conversion
naturalness
intelligibility
prosody
Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight voice conversion
perceptual-guided training
electro-laryngeal speech
WavLM features
human-feedback prediction
🔎 Similar Papers
No similar papers found.
B
Benedikt Mayrhofer
Signal Processing and Speech Communication Laboratory, Graz University of Technology
Franz Pernkopf
Franz Pernkopf
Graz University of Technology
Machine learningartificial intelligencepattern recognitiondiscriminative learningspeech and vision applications
P
Philipp Aichinger
Department of Otorhinolaryngology, Div. Phoniatrics-Logopedics, Medical University of Vienna
Martin Hagmüller
Martin Hagmüller
Graz University of Technology
Speech & Audio Processing