Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study investigates whether machine translation preserves semantic similarity in multilingual political manifestos. Focusing on manifestos in 28 languages, we propose a non-inferiority testing framework that assesses translation fidelity through consistency in inter-model similarity relationships derived from paragraph embeddings, without requiring direct measurement of semantic shift. The approach is generalizable to other corpora and downstream tasks. Using translations generated by the EU eTranslation service and multiple embedding models, we calibrate statistical thresholds via cosine similarity and model disagreement. Our results identify ten languages exhibiting translation invariance, four showing significant semantic distortion, and insufficient evidence for the remaining languages.

📝 Abstract

We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and pipeline-agnostic and extends naturally to downstream tasks. Applied to our data, it identifies ten languages with translation invariance and four with detectable distortion.

Problem

Research questions and friction points this paper is trying to address.

textual similarity

machine translation

semantic invariance

embedding models

translation distortion

Innovation

Methods, ideas, or system contributions that make the work stand out.

translation invariance

cosine similarity

embedding stability