π€ AI Summary
To address semantic fragmentation in simultaneous speech-to-text translation (SimulST) caused by blind streaming chunking, this paper proposes a syntax-aware dynamic chunking strategy. Methodologically, it leverages dependency parsing to identify semantically complete units, jointly optimizes translation timing and content via a frozen Whisper encoder coupled with a decoder-only large language model, and introduces target-side reordering to mitigate sourceβtarget word-order discrepancies; additionally, it employs a dual-mode output (<WAIT>/token) for fine-grained streaming control. Evaluated on the multilingual CoVoST2 benchmark, the approach achieves significant improvements in both BLEU score and latency metrics. It is the first work to empirically demonstrate that explicit syntactic structure modeling delivers critical gains for large-model-based SimulST systems. The proposed framework establishes a new paradigm for semantically coherent real-time translation.
π Abstract
This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or <WAIT> symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.