Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This study investigates the cross-lingual generalization capability and practical bottlenecks of large language models (LLMs) in machine translation for low-resource languages (LRLs). Addressing challenges of data scarcity, privacy sensitivity, and computational constraints, we propose a lightweight collaborative optimization paradigm: integrating heterogeneous surrogate data—including news corpora and bilingual dictionaries—with knowledge distillation and progressive parameter-efficient fine-tuning. To our knowledge, this is the first systematic evaluation of LLM-based translation across 200 LRLs on the FLORES-200 benchmark, empirically uncovering critical limitations in zero-shot and few-shot cross-lingual transfer. Experimental results demonstrate substantial improvements in translation quality for small-scale LLMs on extremely low-resource languages, achieving an average BLEU gain of +12.4 across 37 languages. The findings validate that performance gains need not rely solely on model scaling, offering an effective, resource-efficient alternative for LRL translation.

Technology Category

Application Category

📝 Abstract

Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advancements in Large Language Models (LLMs) and Neural Machine Translation (NMT) have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularly impacting privacy-sensitive and resource-constrained scenarios. This paper systematically evaluates the limitations of current LLMs across 200 languages using benchmarks such as FLORES-200. We also explore alternative data sources, including news articles and bilingual dictionaries, and demonstrate how knowledge distillation from large pre-trained models can significantly improve smaller LRL translations. Additionally, we investigate various fine-tuning strategies, revealing that incremental enhancements markedly reduce performance gaps on smaller LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM limitations for 200 low-resource languages

Exploring alternative data sources for LRL translation

Improving small LRL models via knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs across 200 languages using benchmarks

Uses news articles and bilingual dictionaries as data

Applies knowledge distillation to enhance small LRL models

🔎 Similar Papers

No similar papers found.