LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

๐Ÿ“… 2026-01-04
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of current generative speech models in zero-shot multilingual synthesis and editing, which stem from the scarcity of large-scale, high-quality multilingual speech data with word-level timestamps. To overcome this, the authors introduce LEMAS-Dataset, an open-source corpus spanning 10 languages and 150,000 hours of speech, uniquely annotated with word-level alignment timestamps. Leveraging this dataset, they propose LEMAS-TTS, a non-autoregressive model for zero-shot multilingual text-to-speech synthesis, and LEMAS-Edit, an autoregressive model that formulates speech editing as a masked token infilling task. Through accent adversarial training, CTC loss, and adaptive decoding strategies, the models achieve substantial improvements in cross-lingual accent robustness and naturalness at edit boundaries, demonstrating the efficacy of both the dataset and the proposed methodologies.

Technology Category

Application Category

๐Ÿ“ Abstract
We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.
Problem

Research questions and friction points this paper is trying to address.

multilingual speech corpus
word-level timestamps
generative speech models
speech synthesis
speech editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual speech corpus
zero-shot TTS
accent-adversarial training
masked token infilling
word-level timestamps
๐Ÿ”Ž Similar Papers
No similar papers found.