Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

๐Ÿ“… 2025-09-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current machine translation (MT) quality evaluation heavily relies on text-centric paradigms, limiting its applicability to real-world speech translation scenariosโ€”such as simultaneous interpreting. This paper presents the first systematic investigation into the consistency and sensitivity of incorporating the speech modality into MT evaluation. Through a crowdsourced study on Amazon Mechanical Turk, we compare human ratings of 10 MT systems under both audio-only and text-only conditions. Results demonstrate strong overall agreement between speech-based and text-based evaluations, while speech evaluation additionally reveals statistically significant quality distinctions among systems that remain undetected in text-only assessment. We advocate integrating speech-based evaluation into standard MT benchmarking frameworks, thereby advancing a more natural, context-aware evaluation paradigm. This work provides a novel methodological foundation for developing and optimizing speech translation systems.

Technology Category

Application Category

๐Ÿ“ Abstract
Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating machine translation quality through audio instead of text
Comparing crowd-sourced audio and text-based MT assessments
Testing reliability of speech-based translation evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-based crowd-sourced MT evaluation
Statistical testing for reliability verification
Speech modality integration for assessment
๐Ÿ”Ž Similar Papers
No similar papers found.