🤖 AI Summary
This study addresses the challenge of evaluating automatic speech processing and translation systems in cross-lingual meetings where participants share no common language. To this end, we introduce the first 5-hour multilingual meeting speech corpus featuring human-annotated misinterpretations, covering 12 source languages. The corpus includes ASR transcripts, human-reviewed English translations, structured meeting minutes, and fine-grained misinterpretation annotations. We innovatively formulate cross-lingual misinterpretation detection as a quantifiable NLP task and propose the first hybrid approach integrating human annotation with Gemini large language models for automated misinterpretation localization—achieving 77% recall and 47% precision. The corpus supports benchmarking across multiple tasks, including ASR, neural machine translation, cross-lingual summarization, and misinterpretation detection. All multi-level annotations are publicly released, establishing a foundational evaluation resource for cross-lingual human–machine interaction.
📝 Abstract
Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings.
Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%.