DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

In simultaneous speech translation, speech segmentation must balance translation quality and low latency, yet existing supervised pre-trained models (e.g., SHAS) fail to capture human interpreters’ segmentation preferences. This work introduces, for the first time, human preference alignment into speech segmentation, proposing a large language model (LLM)-based segmentation framework trained via Direct Preference Optimization (DPO), replacing conventional supervised objectives to better emulate real-time interpreting practices. Our method fine-tunes an LLM using DPO and integrates it with the SeamlessM4T v2 translation backbone, evaluated on the ACL 60/60 multilingual corpus. Experiments demonstrate consistent improvements over SHAS across three language pairs—English–Japanese, English–Chinese, and English–German—achieving higher segmentation accuracy, improved BLEU and COMET scores, and reduced average lagging. This establishes a new paradigm for preference-driven, interpretable segmentation in simultaneous ST.

Technology Category

Application Category

📝 Abstract

Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.

Problem

Research questions and friction points this paper is trying to address.

Optimizing segmentation for simultaneous speech translation quality

Addressing limitations of supervised segmentation models lacking human preference

Improving real-time translation latency through preference-aligned LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

DPO-tuned LLMs for segmentation prediction

Preference alignment for natural segmentation points

Outperforms SHAS in accuracy and latency

🔎 Similar Papers

No similar papers found.