A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Russian text-to-speech (TTS) synthesis faces core challenges including vowel reduction, consonant devoicing, variable lexical stress, homographic ambiguity, and unnatural prosody. To address these, we propose a data-centric approach: we construct Balalaika—the first large-scale, high-fidelity Russian speech dataset (>2000 hours)—featuring systematic, expert-driven annotation of punctuation, word-level stress, and prosodic boundaries, coupled with high-precision acoustic-text alignment. This enables phoneme- and word-level speech–text fidelity critical for robust TTS modeling. Models trained on Balalaika—including end-to-end TTS and speech enhancement systems—achieve statistically significant improvements over state-of-the-art baselines in naturalness (MOS), intelligibility (WER), and prosodic accuracy (stress and boundary F1). Our work establishes a reproducible, data-first paradigm for low-resource language TTS, providing both an empirical foundation and scalable methodology for high-quality speech synthesis.

Technology Category

Application Category

📝 Abstract
Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.
Problem

Research questions and friction points this paper is trying to address.

Addressing phonetic challenges in Russian speech synthesis
Improving prosodic accuracy in generative speech models
Resolving homograph ambiguity and unnatural intonation issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Balalaika dataset for Russian speech synthesis
Includes comprehensive textual annotations and stress markings
Outperforms existing datasets in synthesis and enhancement
🔎 Similar Papers
No similar papers found.
Kirill Borodin
Kirill Borodin
MTUCI
deep learning for audiogen AIsafe AI
N
Nikita Vasiliev
Moscow Technical University of Communication and Informatics
Vasiliy Kudryavtsev
Vasiliy Kudryavtsev
MTUCI
machine learning
M
Maxim Maslov
Moscow Technical University of Communication and Informatics
M
Mikhail Gorodnichev
Moscow Technical University of Communication and Informatics
O
Oleg Rogov
Artificial Intelligence Research Institute
Grach Mkrtchian
Grach Mkrtchian
MTUCI
Artificial IntelligenceAlgorithmsData Structures