🤖 AI Summary
Existing SimulST latency metrics exhibit structural bias on artificially pre-segmented short utterances, causing evaluation distortion and unreliable cross-system comparisons. This bias stems from a misalignment between artificial segmentation and realistic streaming conditions. To address this, we propose: (1) YAAL and LongYAAL—word-level alignment-driven latency metrics tailored to short and long audio scenarios, respectively—eliminating segmentation-induced bias; and (2) SoftSegmenter, a resegmentation tool that optimizes long-audio chunking via speech-text alignment. Experiments across multiple languages and models demonstrate that our metrics significantly outperform mainstream alternatives (e.g., AL, AP), while SoftSegmenter improves long-audio alignment accuracy by an average of 12.7%. This work establishes a fairer, more robust, and scenario-adaptive latency evaluation paradigm for SimulST.
📝 Abstract
Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.