Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the inherent trade-off between low latency and high translation quality in conventional simultaneous speech translation systems, which typically rely on offline models coupled with handcrafted read/write policies. To overcome this limitation, the authors propose Hikari, an end-to-end model that unifies simultaneous speech translation and streaming transcription by modeling read/write decisions as probabilistic WAIT tokens, thereby eliminating the need for explicit policy design. The approach integrates a causal alignment architecture, decoder time dilation, and delay-robustness-oriented supervised fine-tuning. Evaluated on English-to-Japanese, German, and Russian tasks, Hikari achieves new state-of-the-art BLEU scores across both low- and high-latency scenarios, significantly outperforming existing baselines.

Technology Category

Application Category

📝 Abstract

Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

Problem

Research questions and friction points this paper is trying to address.

Simultaneous Machine Translation

Streaming Transcription

Speech-to-Text Translation

Low Latency

End-to-End Modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

simultaneous speech translation

end-to-end streaming

WAIT token mechanism