Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This paper identifies an over-inheritance mechanism in sequence-level knowledge distillation (SeqKD), wherein student models excessively absorb teacher models’ instance-level memorization, leading to substantially increased hallucination—i.e., generation of content absent from the source text—particularly on low-quality or counterfactually memorized (CM) data subsets, where anomalous denoising and hallucination amplification occur. It provides the first systematic empirical evidence of a strong coupling between memorization inheritance and hallucination enhancement. To address this, we propose Adaptive-SeqKD, a method that dynamically adjusts distillation weights to actively suppress memorization leakage and hallucination generation. Experiments show that, relative to baseline SeqKD, the student model exhibits a 3.4% increase in exact-match memorization rate and a 57% surge in extractive memorization rate—accompanied by elevated hallucination. In contrast, Adaptive-SeqKD maintains translation quality (BLEU) while significantly reducing both memorization rates and hallucination rate.

Technology Category

Application Category

📝 Abstract

In this work, we explore how instance-level memorization in the teacher Neural Machine Translation (NMT) model gets inherited by the student model in sequence-level knowledge distillation (SeqKD). We find that despite not directly seeing the original training data, students memorize more than baseline models (models of the same size, trained on the original data) -- 3.4% for exact matches and 57% for extractive memorization -- and show increased hallucination rates. Further, under this SeqKD setting, we also characterize how students behave on specific training data subgroups, such as subgroups with low quality and specific counterfactual memorization (CM) scores, and find that students exhibit amplified denoising on low-quality subgroups. Finally, we propose a modification to SeqKD named Adaptive-SeqKD, which intervenes in SeqKD to reduce memorization and hallucinations. Overall, we recommend caution when applying SeqKD: students inherit both their teachers' superior performance and their fault modes, thereby requiring active monitoring.

Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation

Overfitting

Hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequence-level Knowledge Distillation

Adaptive SeqKD

Neural Machine Translation

🔎 Similar Papers

No similar papers found.