๐ค AI Summary
Existing SimulEval frameworks are deprecated, lacking support for output revision, long audio stream processing, and real-time demonstration. Method: We propose StreamST, the first open-source evaluation and demonstration framework tailored for long audio streams in streaming speech translation (SST). It introduces a novel evaluation paradigm compatible with retranslation mechanisms, enabling unified assessment of incremental decoding and output revision systems. We design a lightweight streaming inference interface and modules for computing low-latency metricsโAverage Latency (AL) and Delay-Aware Latency (DAL). Additionally, we integrate an interactive web visualization interface powered by Gradio and WebSockets. Contribution/Results: StreamST significantly enhances evaluation reproducibility and fairness in model comparison. The framework is fully open-sourced and has been adopted by the research community.
๐ Abstract
Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.