Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This study addresses a critical gap in knowledge distillation for machine translation by incorporating environmental impact alongside translation quality. While existing work primarily focuses on performance, it largely overlooks computational costs and associated carbon emissions, hindering informed method selection under resource constraints. To bridge this gap, the paper introduces Machine Learning Carbon Accounting (MLCA) into distillation decision-making, proposing a holistic model that quantifies emissions across the entire lifecycle—including teacher training, distillation, and inference—accounting for both energy consumption and hardware manufacturing. The analysis reveals that distillation dominates emissions in small-scale deployments, whereas inference becomes predominant at scale. Furthermore, word-level distillation consistently achieves a better trade-off between translation quality and carbon footprint than sequence-level approaches, leading to a reproducible protocol for environmentally conscious distillation method selection.

Technology Category

Application Category

📝 Abstract

Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.

Problem

Research questions and friction points this paper is trying to address.

knowledge distillation

machine translation

computational cost

carbon footprint

life cycle assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation

machine translation

carbon footprint