🤖 AI Summary
To address the high computational redundancy of bidirectional encoders and poor adaptability of fixed read/write policies in simultaneous speech translation (SimulST), this paper proposes the first fully unidirectional SimulST architecture. It employs a unidirectional Transformer-based speech encoder to eliminate redundant encoding computations; formulates translation as an interleaved generation task with explicit read/write action labeling; introduces a lightweight dynamic policy head for adaptive streaming decisions; and incorporates joint training via multi-delay knowledge distillation and cross-modal alignment. Evaluated on MuST-C En→De/Es, our method significantly outperforms strong baselines: it improves BLEU by up to 1.8 points at equivalent latency, reduces encoder computation by 37%, and enhances streaming robustness and response real-time performance—achieving a superior latency–quality trade-off.
📝 Abstract
Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have showcased strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on the MuST-C En$
ightarrow$De and En$
ightarrow$Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.