Scaling Laws for Downstream Task Performance in Machine Translation

📅 2024-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how the scale and distribution of pretraining data affect large language models’ (LLMs) performance on machine translation (MT) downstream tasks. Methodologically, we quantify distribution alignment between pretraining and MT data and conduct multi-metric empirical analysis—using BLEU, COMET, and cross-entropy—to characterize performance trends. Results reveal that translation quality increases monotonically with pretraining data size only under high distribution alignment; moderate misalignment induces critical performance fluctuations; and downstream quality often diverges from cross-entropy trends. Crucially, we propose the first high-accuracy scaling law model grounded in a log-law formulation, enabling precise extrapolation of MT performance across varying pretraining data scales. This work establishes distribution alignment as a central regulatory factor in LLM transfer learning, providing both theoretical foundations and practical tools for data selection and scale planning in MT-oriented LLM adaptation.

Technology Category

Application Category

📝 Abstract
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data.
Problem

Research questions and friction points this paper is trying to address.

Scaling laws for downstream machine translation tasks
Impact of pretraining data size on translation quality
Alignment between pretraining and downstream data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfer learning with LLMs
Pretraining data size impact
Log-law predicts translation quality
🔎 Similar Papers
No similar papers found.