🤖 AI Summary
To address the high latency and computational cost in simultaneous speech-to-text translation (SimulST) caused by dialogue-based modeling and large language model (LLM) inference for read/write decisions, this paper proposes a perception-driven streaming decision mechanism—mimicking human interpreters’ real-time identification of semantic units and triggering of translation within continuous speech streams. Departing from dialogue-style training, our method integrates lightweight speech perception and semantic boundary detection modules into an end-to-end trainable, low-overhead decision architecture. Evaluated on multiple SimulST benchmarks, it achieves superior quality-latency trade-offs (ALTO), reducing average latency by 42% and accelerating decision speed by 9.6× over SOTA methods, without sacrificing translation quality. The core contribution is the first integration of cognitively inspired, perception-driven decision-making into SimulST—enabling efficient, robust, and human-intuitive streaming translation.
📝 Abstract
How to make human-interpreter-like read/write decisions for simultaneous speech translation (SimulST) systems? Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task, requiring specialized interleaved training data and relying on computationally expensive large language model (LLM) inference for decision-making. In this paper, we propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech and triggering write decisions to produce translation when a new sense unit is perceived. Experiments against two state-of-the-art baseline systems demonstrate that our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency, where its decision-making is up to 9.6x faster than the baselines.