SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

πŸ“… 2025-08-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address semantic fragmentation in simultaneous speech-to-text translation (SimulST) caused by blind streaming chunking, this paper proposes a syntax-aware dynamic chunking strategy. Methodologically, it leverages dependency parsing to identify semantically complete units, jointly optimizes translation timing and content via a frozen Whisper encoder coupled with a decoder-only large language model, and introduces target-side reordering to mitigate source–target word-order discrepancies; additionally, it employs a dual-mode output (<WAIT>/token) for fine-grained streaming control. Evaluated on the multilingual CoVoST2 benchmark, the approach achieves significant improvements in both BLEU score and latency metrics. It is the first work to empirically demonstrate that explicit syntactic structure modeling delivers critical gains for large-model-based SimulST systems. The proposed framework establishes a new paradigm for semantically coherent real-time translation.

Technology Category

Application Category

πŸ“ Abstract
This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or <WAIT> symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.
Problem

Research questions and friction points this paper is trying to address.

Enhancing simultaneous speech translation with syntax-aware chunking
Minimizing semantic fragmentation in real-time translation streams
Addressing word-order divergence in multilingual speech translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grammar-based chunking strategy for semantic units
End-to-end framework integrating Whisper and LLM
Dynamic output with translation tokens and reordering
πŸ”Ž Similar Papers
No similar papers found.
Z
Zeyu Yang
The Chinese University of Hong Kong, Shenzhen, China
L
Lai Wei
The Chinese University of Hong Kong, Shenzhen, China
Roman Koshkin
Roman Koshkin
Okinawa Institute of Science and Technology
artificial intelligencecomputational neurosciencesimultaneous machine translation
X
Xi Chen
The Chinese University of Hong Kong, Shenzhen, China
Satoshi Nakamura
Satoshi Nakamura
The Chinese University of Hong Kong, Shenzhen
speech and natural language processing