🤖 AI Summary
In speech translation, inaccurate translation of low-frequency phrases remains a critical challenge. To address this, we propose a dictionary-based logit bias augmentation method grounded in source–target phrase pair mappings—the first approach to integrate structured phrase dictionaries into speech translation bias mechanisms, enabling cross-architecture adaptation for both streaming speech translation models and multimodal large language models (MLLMs). Our method comprises three components: (1) construction of a phrase dictionary with dynamic logit bias injection, (2) joint modeling of streaming transcription and translation, and (3) integration of external phrase knowledge into MLLMs. Experiments demonstrate a 21% relative improvement in phrase translation accuracy for streaming models and an 85% relative gain in phrase recall for MLLMs, significantly enhancing low-frequency phrase modeling. This work establishes a novel paradigm for injecting external structured knowledge into speech translation systems.
📝 Abstract
Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model. Experimental results show that the phrase dictionary biasing method outperforms phrase list biasing by 21% relatively for the streaming speech translation model. In addition, phrase dictionary biasing enables multimodal large language models to use external phrase information, achieving 85% relative improvement in phrase recall.