Sylber 2.0: A Universal Syllable Embedding

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a self-supervised syllable-level speech encoding framework that addresses the limitations of existing syllable modeling approaches, which are predominantly constrained to English and struggle to simultaneously achieve high-fidelity speech reconstruction and cross-lingual generalization under low temporal resolution. Operating at an ultra-low frame rate of approximately 5 Hz, the method enables efficient temporal compression while preserving high-quality reconstruction. It introduces, for the first time, a cross-lingually universal syllable embedding that effectively integrates linguistic content with fine-grained acoustic details. When integrated into a compact text-to-speech (TTS) model with only 72 million parameters, the approach achieves state-of-the-art synthesis quality. Furthermore, the learned representations serve as highly effective features for automatic speech recognition (ASR) in low-resource settings.

Technology Category

Application Category

📝 Abstract
Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M parameters. Moreover, the universality of Sylber 2.0 provides more effective features for low resource ASR than previous speech coding frameworks. In sum, we establish an effective syllable-level abstraction for general spoken language.
Problem

Research questions and friction points this paper is trying to address.

syllable embedding
universal speech representation
low-resource ASR
spoken language modeling
acoustic detail
Innovation

Methods, ideas, or system contributions that make the work stand out.

syllable embedding
self-supervised speech modeling
low token rate
universal speech representation
efficient TTS
🔎 Similar Papers
No similar papers found.
Cheol Jun Cho
Cheol Jun Cho
UC Berkeley, EECS
AIMachine LearningSpeech ProcessingNeuroscienceBrain-Computer Interfaces
N
Nicholas Lee
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
A
Alan W. Black
Carnegie Mellon University, PA, USA
G
G. Anumanchipalli
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA