PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the scarcity of annotated data that limits the performance of large language models in speech-to-speech translation (S2ST). The authors propose an unsupervised preference optimization framework that operates without human annotations. It begins with an initial training phase leveraging a tri-task learning setup and a modality chaining approach, followed by the automatic generation of high-quality preference pairs through self-sampling and back-translation. A progressive training strategy is then employed to refine the model iteratively. This approach effectively alleviates the data bottleneck inherent in S2ST, achieving substantial improvements in translation quality across multiple benchmarks. The results demonstrate the efficacy and novelty of the self-generated preference mechanism combined with the multi-stage training paradigm.

Technology Category

Application Category

📝 Abstract

Although Large Language Models (LLMs) excel in many tasks, their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity. To bridge this gap, we propose PROST-LLM (PROgressive Speech-to-speech Translation) to enhance the S2ST capabilities in LLMs progressively. First, we fine-tune the LLMs with the CVSS corpus, employing designed tri-task learning and chain of modality methods to boost the initial performance. Then, leveraging the fine-tuned model, we generate preference pairs through self-sampling and back-translation without human evaluation. Finally, these preference pairs are used for preference optimization to enhance the model's S2ST capability further. Extensive experiments confirm the effectiveness of our proposed PROST-LLM in improving the S2ST capability of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Speech-to-Speech Translation

Large Language Models

Data Scarcity

S2ST

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-to-Speech Translation

Large Language Models

Preference Optimization

Tri-task Learning