🤖 AI Summary
Modern Hebrew text-to-speech (TTS) faces challenges including complex orthography, implicit stress placement, and sparse vowel diacritization, rendering existing grapheme-to-phoneme (G2P) approaches inadequate for high-accuracy, low-latency International Phonetic Alphabet (IPA) transcription. To address this, we propose the first lightweight two-stage G2P framework: Stage I leverages a pre-trained Hebrew diacritization model; Stage II introduces a compact neural adapter, augmented with rule-based post-processing and IPA mapping. We further present and publicly release ILSpeech—the first open-source Hebrew speech corpus annotated with IPA transcriptions. Our method achieves full phonemic normalization with zero measurable latency and substantially outperforms state-of-the-art G2P systems in accuracy. It enables training of a real-time, high-fidelity Hebrew TTS system, achieving the best-known speed–accuracy trade-off. All code, models, and data are open-sourced.
📝 Abstract
Real-time text-to-speech (TTS) for Modern Hebrew is challenging due to the language's orthographic complexity. Existing solutions ignore crucial phonetic features such as stress that remain underspecified even when vowel marks are added. To address these limitations, we introduce Phonikud, a lightweight, open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified IPA transcriptions. Our approach adapts an existing diacritization model with lightweight adaptors, incurring negligible additional latency. We also contribute the ILSpeech dataset of transcribed Hebrew speech with IPA annotations, serving as a benchmark for Hebrew G2P and as training data for TTS systems. Our results demonstrate that Phonikud G2P conversion more accurately predicts phonemes from Hebrew text compared to prior methods, and that this enables training of effective real-time Hebrew TTS models with superior speed-accuracy trade-offs. We release our code, data, and models at https://phonikud.github.io.