PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of annotated data that limits the performance of large language models in speech-to-speech translation (S2ST). The authors propose an unsupervised preference optimization framework that operates without human annotations. It begins with an initial training phase leveraging a tri-task learning setup and a modality chaining approach, followed by the automatic generation of high-quality preference pairs through self-sampling and back-translation. A progressive training strategy is then employed to refine the model iteratively. This approach effectively alleviates the data bottleneck inherent in S2ST, achieving substantial improvements in translation quality across multiple benchmarks. The results demonstrate the efficacy and novelty of the self-generated preference mechanism combined with the multi-stage training paradigm.

Technology Category

Application Category

📝 Abstract
Although Large Language Models (LLMs) excel in many tasks, their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity. To bridge this gap, we propose PROST-LLM (PROgressive Speech-to-speech Translation) to enhance the S2ST capabilities in LLMs progressively. First, we fine-tune the LLMs with the CVSS corpus, employing designed tri-task learning and chain of modality methods to boost the initial performance. Then, leveraging the fine-tuned model, we generate preference pairs through self-sampling and back-translation without human evaluation. Finally, these preference pairs are used for preference optimization to enhance the model's S2ST capability further. Extensive experiments confirm the effectiveness of our proposed PROST-LLM in improving the S2ST capability of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Speech-to-Speech Translation
Large Language Models
Data Scarcity
S2ST
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-to-Speech Translation
Large Language Models
Preference Optimization
Tri-task Learning
Chain of Modality
🔎 Similar Papers
No similar papers found.
Jing Xu
Jing Xu
Hong Kong University of Science and Technology (Guangzhou)
Computer VisionAI applicationRepresentation learning
Jiaqi Wang
Jiaqi Wang
Unknown affiliation
D
Daxin Tan
Huawei Artificial Intelligence Laboratory (Leibniz)
X
Xiao Chen
Huawei Artificial Intelligence Laboratory (Leibniz)