Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational redundancy of bidirectional encoders and poor adaptability of fixed read/write policies in simultaneous speech translation (SimulST), this paper proposes the first fully unidirectional SimulST architecture. It employs a unidirectional Transformer-based speech encoder to eliminate redundant encoding computations; formulates translation as an interleaved generation task with explicit read/write action labeling; introduces a lightweight dynamic policy head for adaptive streaming decisions; and incorporates joint training via multi-delay knowledge distillation and cross-modal alignment. Evaluated on MuST-C En→De/Es, our method significantly outperforms strong baselines: it improves BLEU by up to 1.8 points at equivalent latency, reduces encoder computation by 37%, and enhances streaming robustness and response real-time performance—achieving a superior latency–quality trade-off.

Technology Category

Application Category

📝 Abstract
Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have showcased strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on the MuST-C En$ ightarrow$De and En$ ightarrow$Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.
Problem

Research questions and friction points this paper is trying to address.

Apply LLMs to simultaneous speech translation efficiently
Overcome computational overhead of bidirectional speech encoding
Adaptive inference with dynamic read/write policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully unidirectional architecture for SimulST
Multi-latency data curation strategy
Lightweight policy head for adaptive inference
🔎 Similar Papers
No similar papers found.
Biao Fu
Biao Fu
Xiamen University
LLMsReasoningMachine Translation
Donglei Yu
Donglei Yu
Institute of Automation, Chinese Academy of Sciences
simultaneous machine translationlarge language model
M
Minpeng Liao
Alibaba Group Tongyi Lab
C
Chengxi Li
Alibaba Group Tongyi Lab
Y
Yidong Chen
School of Informatics, Xiamen University; Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism
Kai Fan
Kai Fan
ByteDance
Machine learningBayesian Deep LearningMachine translationLLMs
Xiaodong Shi
Xiaodong Shi
Xiamen University
natural language processing