How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses the practical challenge in speech translation (ST) evaluation where source speech utterances—and thus their textual transcriptions—are unavailable. We propose the first systematic source-aware automatic evaluation framework for such scenarios. Our method comprises: (1) a two-step cross-lingual resegmentation algorithm to mitigate alignment mismatches between synthetic source proxies and reference translations; and (2) leveraging either ASR transcripts or back-translated text as source surrogates, integrated with source-aware neural machine translation metrics. Experiments across 79 language pairs and six ST systems show that ASR transcripts yield higher reliability when WER < 20%, whereas back-translation offers comparable performance at lower cost. The proposed framework significantly improves correlation with human judgments—e.g., average Kendall’s τ increases by 0.18—establishing a scalable, highly correlated evaluation paradigm for source-free ST assessment.

Technology Category

Application Category

📝 Abstract

Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

Problem

Research questions and friction points this paper is trying to address.

Developing source-aware metrics for speech translation without source transcripts

Addressing alignment mismatch between synthetic sources and reference translations

Evaluating speech translation when source audio lacks reliable text alignments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using ASR transcripts as synthetic source for metrics

Employing back-translations as alternative synthetic source

Introducing cross-lingual re-segmentation for alignment correction

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

AI Language Engineer

Cresta

$90,000–$160,000 + Offers Equity

United States (Remote) / US (Remote)

Research Scientist Intern, Multimodal AI (PhD)