Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models

๐Ÿ“… 2024-02-16
๐Ÿ›๏ธ International Workshop on Spoken Language Translation
๐Ÿ“ˆ Citations: 4
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Simultaneous machine translation (SimulMT) faces a fundamental trade-off among translation quality, latency, and the computational cost of large language model (LLM) inference. To address this, we propose the first multi-turn conversational decoding framework tailored for SimulMT, integrating Llama2-7b-chat into streaming translation. Our approach introduces a dynamic waiting policy and a lightweight context compression mechanism to substantially reduce autoregressive decoding overhead. Crucially, it shifts from conventional single-pass generation to iterative, interactive decodingโ€”enabling fine-grained incremental output while preserving semantic coherence. Evaluated on two standard SimulMT benchmarks, our method surpasses dedicated SimulMT models in BLEU score, achieves comparable average latency, and reduces latency by over 42% compared to standard LLM-based streaming translation. To the best of our knowledge, this is the first work to jointly achieve high translation quality, low latency, and computational efficiency in LLM-driven SimulMT.

Technology Category

Application Category

๐Ÿ“ Abstract
Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.
Problem

Research questions and friction points this paper is trying to address.

Balancing translation quality and latency in simultaneous machine translation
Reducing high inference cost and latency in LLM-based SimulMT
Enhancing efficiency of simultaneous translation using dialogue-based decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn dialogue decoding enhances LLM efficiency
Conversational framework reduces inference cost and latency
LLMs achieve quality comparable to specialized translation models
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Minghan Wang
Department of Data Science & AI, Monash University
Thuy-Trang Vu
Thuy-Trang Vu
Monash University
Natural Language ProcessingMachine Learning
Ehsan Shareghi
Ehsan Shareghi
Monash University
Natural Language Processing
G
Gholamreza Haffari
Department of Data Science & AI, Monash University