Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes the severe distortion caused by data contamination—i.e., inadvertent inclusion of test samples in training data—on machine translation (MT) evaluation. We systematically inject contamination at the source, target, and joint levels into 1B- and 8B-parameter LLMs, employing rigorously decontaminated baselines and multi-granularity contamination control. Our analysis reveals a nonlinear amplification effect: under full contamination, BLEU scores for the 8B model are artificially inflated 2.5× more than those for the 1B model; joint source+target contamination induces spurious BLEU gains up to 30 points. Further modeling shows that contamination frequency and temporal distribution differentially bias multilingual evaluation—with low-resource languages exhibiting heightened sensitivity—and even minimal contamination causes significant, inconsistent evaluation distortion. The study establishes a contamination-attribution framework for trustworthy MT evaluation and provides scale-aware calibration principles grounded in empirical evidence.

Technology Category

Application Category

📝 Abstract
Data contamination -- the accidental consumption of evaluation examples within the pre-training data -- can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.
Problem

Research questions and friction points this paper is trying to address.

Data Contamination
Machine Translation Accuracy
BLEU Score Inflation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Pollution
Language Models
Machine Translation Performance
🔎 Similar Papers
No similar papers found.