Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of evaluating automatic speech processing and translation systems in cross-lingual meetings where participants share no common language. To this end, we introduce the first 5-hour multilingual meeting speech corpus featuring human-annotated misinterpretations, covering 12 source languages. The corpus includes ASR transcripts, human-reviewed English translations, structured meeting minutes, and fine-grained misinterpretation annotations. We innovatively formulate cross-lingual misinterpretation detection as a quantifiable NLP task and propose the first hybrid approach integrating human annotation with Gemini large language models for automated misinterpretation localization—achieving 77% recall and 47% precision. The corpus supports benchmarking across multiple tasks, including ASR, neural machine translation, cross-lingual summarization, and misinterpretation detection. All multi-level annotations are publicly released, establishing a foundational evaluation resource for cross-lingual human–machine interaction.

Technology Category

Application Category

📝 Abstract
Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings. Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%.
Problem

Research questions and friction points this paper is trying to address.

Creating a cross-lingual dialogue corpus for evaluating speech translation systems
Proposing automatic detection of misunderstandings in cross-lingual meetings
Quantifying and annotating misunderstandings to test large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created cross-lingual dialogue corpus with speech translation
Proposed automatic detection of misunderstandings in meetings
Evaluated large language models for misunderstanding identification
🔎 Similar Papers
No similar papers found.
M
Marko Čechovič
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL), Prague, Czechia
N
Natália Komorníková
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL), Prague, Czechia
D
Dominik Macháček
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL), Prague, Czechia
Ondřej Bojar
Ondřej Bojar
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
machine translationspeech translationparsingtreebanking