Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study investigates the information loss mechanisms of large language models (LLMs) in multilingual translation, focusing on the interplay among training data scale, linguistic distance, and genealogical relatedness. Using back-translation experiments with GPT-4 and Llama 2, evaluated via BLEU and BERTScore, we identify three key findings: (1) Languages closer to English in linguistic distance exhibit superior information retention under low-resource conditions; (2) sufficient training data substantially mitigates information loss attributable to structural divergence; and (3) orthographic, genealogical, syntactic, and geographic distances all serve as strong predictors of translation performance, with language family membership providing independent explanatory power. Crucially, this work provides the first systematic empirical validation of strong interaction effects between multidimensional linguistic distances and data scale. The results yield interpretable, quantitative insights for optimizing translation systems—particularly for low-resource languages—grounded in linguistically informed metrics.

Technology Category

Application Category

📝 Abstract

Large language models have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs-particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation. We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations. Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low-resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. Language family also exerts an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages.

Problem

Research questions and friction points this paper is trying to address.

Investigates information loss in multilingual translation by LLMs

Examines impact of training data, language proximity, and family

Assesses translation quality using BLEU and BERT metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates GPT-4 and Llama 2 via round-trip translations

Uses BLEU scores and BERT similarity for quality assessment

Analyzes orthographic, phylogenetic, syntactic, geographical distances

🔎 Similar Papers

No similar papers found.