Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning

📅 2025-06-05
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the grapheme–phoneme and grapheme–prosody inconsistency problem in speech phoneme and prosody annotation. We propose an end-to-end grapheme-consistency modeling framework that jointly integrates implicit grapheme modeling—via a BERT-based prompt encoder—and explicit grapheme constraints—implemented through grapheme-consistency pruning—to construct speech–annotation–text triple parallel data. To our knowledge, this is the first approach to achieve fully automatic speech annotation with strict grapheme consistency *without* manual alignment. We validate its effectiveness on downstream tasks including text-to-speech (TTS) and accent estimation: the generated parallel data significantly improves accent recognition accuracy. Our work establishes a reliable weakly supervised annotation paradigm for speech representation learning, offering both methodological novelty—through unified implicit/explicit grapheme modeling—and practical utility in low-resource annotation scenarios.

Technology Category

Application Category

📝 Abstract
We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.
Problem

Research questions and friction points this paper is trying to address.

Generating grapheme-coherent phonemic and prosodic speech labels
Enhancing label-grapheme consistency via implicit and explicit conditioning
Creating parallel speech-label-grapheme data for downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit grapheme conditioning via BERT
Explicit pruning of inconsistent labels
Generates parallel speech-label-grapheme data
🔎 Similar Papers
No similar papers found.
H
Hien Ohnaka
Nara Institute of Science and Technology, Japan
Yuma Shirahata
Yuma Shirahata
LY Corporation
B
Byeongseon Park
LY Corporation, Japan
Ryuichi Yamamoto
Ryuichi Yamamoto
LY Corporation
Speech SynthesisVoice ConversionSpeech RecognitionMachine LearningSinging Voice Synthesis