Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing SimulST latency metrics exhibit structural bias on artificially pre-segmented short utterances, causing evaluation distortion and unreliable cross-system comparisons. This bias stems from a misalignment between artificial segmentation and realistic streaming conditions. To address this, we propose: (1) YAAL and LongYAAL—word-level alignment-driven latency metrics tailored to short and long audio scenarios, respectively—eliminating segmentation-induced bias; and (2) SoftSegmenter, a resegmentation tool that optimizes long-audio chunking via speech-text alignment. Experiments across multiple languages and models demonstrate that our metrics significantly outperform mainstream alternatives (e.g., AL, AP), while SoftSegmenter improves long-audio alignment accuracy by an average of 12.7%. This work establishes a fairer, more robust, and scenario-adaptive latency evaluation paradigm for SimulST.

Technology Category

Application Category

📝 Abstract

Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating latency metrics for simultaneous speech-to-text translation systems

Addressing structural bias in current metrics caused by speech segmentation

Improving accuracy of latency measurement across different translation regimes

Innovation

Methods, ideas, or system contributions that make the work stand out.

YAAL metric improves short-form latency evaluation

LongYAAL extends metric to unsegmented audio scenarios

SoftSegmenter tool enhances alignment via word-level resegmentation

🔎 Similar Papers

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

2023-08-07Conference on Empirical Methods in Natural Language ProcessingCitations: 3

Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models

2024-02-16International Workshop on Spoken Language TranslationCitations: 4