Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This study investigates whether machine translation preserves semantic similarity in multilingual political manifestos. Focusing on manifestos in 28 languages, we propose a non-inferiority testing framework that assesses translation fidelity through consistency in inter-model similarity relationships derived from paragraph embeddings, without requiring direct measurement of semantic shift. The approach is generalizable to other corpora and downstream tasks. Using translations generated by the EU eTranslation service and multiple embedding models, we calibrate statistical thresholds via cosine similarity and model disagreement. Our results identify ten languages exhibiting translation invariance, four showing significant semantic distortion, and insufficient evidence for the remaining languages.
📝 Abstract
We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and pipeline-agnostic and extends naturally to downstream tasks. Applied to our data, it identifies ten languages with translation invariance and four with detectable distortion.
Problem

Research questions and friction points this paper is trying to address.

textual similarity
machine translation
semantic invariance
embedding models
translation distortion
Innovation

Methods, ideas, or system contributions that make the work stand out.

translation invariance
cosine similarity
embedding stability
non-inferiority test
multilingual semantic preservation
🔎 Similar Papers
No similar papers found.
D
Daria Boratyn
Jagiellonian Center for Quantitative Political Science, Jagiellonian University, Kraków, Poland
D
Damian Brzyski
Jagiellonian Center for Quantitative Political Science, Jagiellonian University, Kraków, Poland
A
Albert Leśniak
Jagiellonian Center for Quantitative Political Science, Jagiellonian University, Kraków, Poland
W
Wojciech Łukasik
Jagiellonian Center for Quantitative Political Science, Jagiellonian University, Kraków, Poland
M
Maciej Rapacz
AGH University, Kraków, Poland
Jan Rybicki
Jan Rybicki
Jagiellonian University. Kraków, Poland
stylometryauthorship attributiontranslation studiescomparative literaturedigital humanities
W
Wojciech Słomczyński
Jagiellonian Center for Quantitative Political Science, Jagiellonian University, Kraków, Poland
Dariusz Stolicki
Dariusz Stolicki
Jagiellonian Center for Quantitative Political Science, Jagiellonian University
electoral studiessocial choicegerrymanderingAmerican constitutional lawlegislative studies