TokenChain: A Discrete Speech Chain via Semantic Token Modeling

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the bottleneck in joint ASR–TTS modeling within the speech chain by proposing the first fully discrete speech chain framework, using semantic tokens as a unified intermediate representation. Methodologically: (1) it introduces a two-stage discrete architecture—semantic-level ASR, autoregressive text-to-semantic modeling, and masked generative semantic-to-acoustic modeling; (2) it employs straight-through argmax and Gumbel-Softmax with dynamic temperature scheduling for stable training; (3) it establishes an end-to-end feedback mechanism to propagate cross-task information through the text interface and adopts dynamic weight averaging to enhance cross-domain transferability. Experiments show accelerated convergence by 2–6 epochs on LibriSpeech, with 5–13% relative WER reduction at equal training epochs; on TED-LIUM, ASR WER improves by 56% relatively and T2S WER by 31%, while catastrophic forgetting is significantly mitigated.

Technology Category

Application Category

📝 Abstract
Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.
Problem

Research questions and friction points this paper is trying to address.

Proposes TokenChain to improve ASR and TTS through discrete semantic token modeling
Enables end-to-end feedback via text interface with Gumbel-Softmax optimization
Achieves faster convergence and lower error rates in speech recognition and synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete speech chain with semantic token modeling
Two-stage TTS combining autoregressive and masked-generative models
End-to-end feedback via straight-through gradient estimation
🔎 Similar Papers
No similar papers found.
M
Mingxuan Wang
School of Data Science, The Chinese University of Hong Kong, Shenzhen
Satoshi Nakamura
Satoshi Nakamura
The Chinese University of Hong Kong, Shenzhen
speech and natural language processing