Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource and zero-resource speech-to-text translation (S2TT) suffers from a severe scarcity of labeled speech data. Method: We propose a phoneme-augmented chain-of-thought (CoT) reasoning framework that leverages phoneme recognition as an interpretable intermediate representation—marking the first integration of phonemic representations into the CoT paradigm for unsupervised cross-lingual transfer. Our approach jointly models speech, phonemes, and text in three stages, synergistically combining multilingual large language models (MLLMs) with a progressive curriculum learning strategy. Contribution/Results: The method achieves substantial improvements in translation quality for low-resource languages across multilingual S2TT benchmarks. Notably, it enables the first end-to-end S2TT for zero-resource languages without any target-language speech or text supervision. It demonstrates strong generalization capability and practical deployability, bridging a critical gap between interpretability, unsupervised adaptation, and real-world applicability in speech translation.

Technology Category

Application Category

📝 Abstract
We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.
Problem

Research questions and friction points this paper is trying to address.

Improving speech-to-text translation in low-resource settings
Enabling zero-resource translation using phoneme-augmented CoT
Enhancing cross-lingual transfer with phoneme recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates phoneme representations into CoT
Uses phoneme recognition for cross-lingual transfer
Employs curriculum learning for progressive training
🔎 Similar Papers
No similar papers found.
G
Gerard I. G'allego
Barcelona Supercomputing Center, Spain; Universitat Polit`ecnica de Catalunya, Spain
Oriol Pareras
Oriol Pareras
Research Engineer, Barcelona Supercomputing Center
Natural Language ProcessingMultimodalityDeep Learning
M
Mart'i Cortada Garcia
Barcelona Supercomputing Center, Spain
L
Lucas Takanori
Barcelona Supercomputing Center, Spain
Javier Hernando
Javier Hernando
Professor of Electrionic Egineering, Universitat Politecnica de Catalunya
Speech ProcessingBiometrics