InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses real-time simultaneous speech translation (SST) for unbounded streaming speech by proposing an end-to-end modeling framework that eliminates the need for pre-segmentation of input audio. It reformulates SST as a multi-turn dialogue task. Methodologically, it introduces a novel multi-delay-augmented data construction strategy and a dynamic key-value (KV) cache management mechanism—enabling efficient retention of historical speech and translation context while substantially reducing computational overhead and latency. Built upon a large language model architecture, the approach leverages trajectory-style reconstruction and training on the MuST-C corpus. Experiments across three language pairs (En→Es, En→De, En→Zh) demonstrate a computation-aware latency reduction of 0.5–1.0 seconds over strong baselines, with BLEU scores preserved. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Simultaneous translation of unbounded streaming speech remains a challenging problem due to the need for effectively processing the history speech context and past translations so that quality and latency, including computation overhead, can be balanced. Most prior works assume pre-segmented speech, limiting their real-world applicability. In this paper, we propose InfiniSST, a novel approach that formulates SST as a multi-turn dialogue task, enabling seamless translation of unbounded speech. We construct translation trajectories and robust segments from MuST-C with multi-latency augmentation during training and develop a key-value (KV) cache management strategy to facilitate efficient inference. Experiments on MuST-C En-Es, En-De, and En-Zh demonstrate that InfiniSST reduces computation-aware latency by 0.5 to 1 second while maintaining the same translation quality compared to baselines. Ablation studies further validate the contributions of our data construction and cache management strategy. We release the code at https://github.com/LeiLiLab/InfiniSST
Problem

Research questions and friction points this paper is trying to address.

Simultaneous translation of unbounded streaming speech
Balancing quality, latency, and computation overhead
Real-world applicability of pre-segmented speech limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formulates SST as multi-turn dialogue task
Uses multi-latency augmentation for training
Implements KV cache management for efficiency
🔎 Similar Papers
No similar papers found.
Siqi Ouyang
Siqi Ouyang
PhD Student, Language Technologies Institute, Carnegie Mellon University
Speech TranslationLarge Language Model
X
Xi Xu
Carnegie Mellon University, Language Technologies Institute
L
Lei Li
Carnegie Mellon University, Language Technologies Institute