SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

To address semantic fragmentation in simultaneous speech-to-text translation (SimulST) caused by blind streaming chunking, this paper proposes a syntax-aware dynamic chunking strategy. Methodologically, it leverages dependency parsing to identify semantically complete units, jointly optimizes translation timing and content via a frozen Whisper encoder coupled with a decoder-only large language model, and introduces target-side reordering to mitigate source–target word-order discrepancies; additionally, it employs a dual-mode output (<WAIT>/token) for fine-grained streaming control. Evaluated on the multilingual CoVoST2 benchmark, the approach achieves significant improvements in both BLEU score and latency metrics. It is the first work to empirically demonstrate that explicit syntactic structure modeling delivers critical gains for large-model-based SimulST systems. The proposed framework establishes a new paradigm for semantically coherent real-time translation.

Technology Category

Application Category

📝 Abstract

This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or <WAIT> symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.

Problem

Research questions and friction points this paper is trying to address.

Enhancing simultaneous speech translation with syntax-aware chunking

Minimizing semantic fragmentation in real-time translation streams

Addressing word-order divergence in multilingual speech translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grammar-based chunking strategy for semantic units

End-to-end framework integrating Whisper and LLM

Dynamic output with translation tokens and reordering

🔎 Similar Papers

Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models