🤖 AI Summary
This study addresses the scarcity of unified, high-quality resources for comparative analysis of written and spoken translation modes, disfluencies, and translational language features from an information-theoretic perspective. By deeply integrating the EPIC-UdS (interpreting) and EuroParl-UdS (translation) corpora, the work corrects metadata and textual errors, enhances linguistic annotation, and introduces— for the first time—word-level alignments and surprisal measures derived from information theory. Leveraging probabilistic outputs from GPT-2 and machine translation systems, the project constructs a multi-layered corpus resource and establishes an evaluation framework specifically designed for predicting fillers in interpreting. Empirical validation confirms the integrity of reconstructed spoken data and demonstrates the efficacy of surprisal in filler prediction, offering novel evidence for research into the cognitive mechanisms underlying translation.
📝 Abstract
This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.