Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the scarcity of parallel speech data for low-resource languages like Persian—which severely limits end-to-end speech-to-speech translation (S2ST) performance—this paper proposes an integrated end-to-end solution combining discrete speech unit modeling with synthetic data augmentation. We innovatively design a parallel corpus generation pipeline leveraging large language models and zero-shot text-to-speech synthesis, expanding the Persian–English speech pairs in the CVSS corpus by approximately sixfold. The architecture employs a Conformer encoder, a causal Transformer decoder equipped with relative positional multi-head attention, and a discrete-unit-driven neural vocoder, jointly optimized via self-supervised pretraining to enhance training stability. Experimental results demonstrate a +4.6 BLEU improvement over strong baselines on ASR-based evaluation, confirming substantial gains in translation quality and generalization capability under low-resource S2ST settings.

Technology Category

Application Category

📝 Abstract

Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English

Problem

Research questions and friction points this paper is trying to address.

Developing direct Persian-English speech translation with limited parallel data

Generating synthetic parallel speech data using translation and synthesis

Improving translation quality through discrete units and self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using discrete speech units for translation

Generating synthetic parallel speech data

Self-supervised pre-training for acoustic representations

🔎 Similar Papers

No similar papers found.