RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the challenge of low accuracy in simultaneous speech translation when handling rare and domain-specific terminology. The authors propose a tightly integrated cross-modal retrieval mechanism that dynamically supplies term-level hints during incremental translation via a lightweight speech-to-text retriever combined with a sliding window strategy. These hints are adaptively fused into a retrieval-augmented speech large language model trained on synthetic data, enabling efficient integration and contextual usage decisions for partial inputs. To the best of the authors’ knowledge, this is the first approach to achieve real-time cross-modal term retrieval and adaptive fusion tailored for incremental input streams. Evaluated on the ACL 60/60 development set across three language directions, the method yields up to a 16% absolute improvement in term translation accuracy and a 3-point gain in overall BLEU score.

Technology Category

Application Category

📝 Abstract

Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise terminology hints to the Speech LLM. We further synthesize training data that teaches the Speech LLM to leverage retrieved terms precisely. Experiments on three language directions of the ACL 60/60 dev set show that RASST improves terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points, with ablations confirming the contribution of each component.

Problem

Research questions and friction points this paper is trying to address.

simultaneous speech translation

retrieval augmentation

cross-modal retrieval

terminology translation

incremental generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented

simultaneous speech translation

cross-modal retrieval