🤖 AI Summary
This study addresses a critical gap in knowledge distillation for machine translation by incorporating environmental impact alongside translation quality. While existing work primarily focuses on performance, it largely overlooks computational costs and associated carbon emissions, hindering informed method selection under resource constraints. To bridge this gap, the paper introduces Machine Learning Carbon Accounting (MLCA) into distillation decision-making, proposing a holistic model that quantifies emissions across the entire lifecycle—including teacher training, distillation, and inference—accounting for both energy consumption and hardware manufacturing. The analysis reveals that distillation dominates emissions in small-scale deployments, whereas inference becomes predominant at scale. Furthermore, word-level distillation consistently achieves a better trade-off between translation quality and carbon footprint than sequence-level approaches, leading to a reproducible protocol for environmentally conscious distillation method selection.
📝 Abstract
Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.