Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the information loss mechanisms of large language models (LLMs) in multilingual translation, focusing on the interplay among training data scale, linguistic distance, and genealogical relatedness. Using back-translation experiments with GPT-4 and Llama 2, evaluated via BLEU and BERTScore, we identify three key findings: (1) Languages closer to English in linguistic distance exhibit superior information retention under low-resource conditions; (2) sufficient training data substantially mitigates information loss attributable to structural divergence; and (3) orthographic, genealogical, syntactic, and geographic distances all serve as strong predictors of translation performance, with language family membership providing independent explanatory power. Crucially, this work provides the first systematic empirical validation of strong interaction effects between multidimensional linguistic distances and data scale. The results yield interpretable, quantitative insights for optimizing translation systems—particularly for low-resource languages—grounded in linguistically informed metrics.

Technology Category

Application Category

📝 Abstract
Large language models have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs-particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation. We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations. Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low-resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. Language family also exerts an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages.
Problem

Research questions and friction points this paper is trying to address.

Investigates information loss in multilingual translation by LLMs
Examines impact of training data, language proximity, and family
Assesses translation quality using BLEU and BERT metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates GPT-4 and Llama 2 via round-trip translations
Uses BLEU scores and BERT similarity for quality assessment
Analyzes orthographic, phylogenetic, syntactic, geographical distances
🔎 Similar Papers
No similar papers found.
Y
Yumeng Lin
Department of Linguistics and Modern Languages, The Chinese University of Hong Kong, Hong Kong, Hong Kong SAR, China
X
Xufeng Duan
Department of Linguistics and Modern Languages, The Chinese University of Hong Kong, Hong Kong, Hong Kong SAR, China
D
David Haslett
Division of Social Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Yige Chen
Yige Chen
College of Computer Science and Artificial Intelligence, Wenzhou University
Networking
Zhenguang G. Cai
Zhenguang G. Cai
Professor, The Chinese University of Hong Kong
Psycholinguisticsexplainable AIpsychophysics