Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation

πŸ“… 2025-02-06
πŸ›οΈ International Joint Conference on Natural Language Processing
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing non-autoregressive multilingual neural machine translation (MNMT) heavily relies on computationally expensive knowledge distillation (KD) to achieve competitive performance, hindering efficiency and scalability. This paper proposes M-DAT, the first KD-free framework for non-autoregressive multilingual translation. Built upon the Directed Acyclic Transformer (DAT) architecture, M-DAT integrates multilingual joint training with a novel pivot back-translation (PivotBT) strategy to explicitly model latent cross-lingual alignments, thereby substantially improving zero-shot generalization to unseen language directions. Evaluated on standard multilingual benchmarks, M-DAT achieves state-of-the-art performance among non-autoregressive models: it attains a 3.2Γ— speedup over autoregressive baselines while incurring only a marginal BLEU degradation of 0.4–0.8 points. Thus, M-DAT bridges the longstanding trade-off between inference efficiency and translation accuracy in multilingual NMT, enabling scalable, high-fidelity non-autoregressive translation without KD.

Technology Category

Application Category

πŸ“ Abstract
Multilingual neural machine translation (MNMT) aims at using one single model for multiple translation directions. Recent work applies non-autoregressive Transformers to improve the efficiency of MNMT, but requires expensive knowledge distillation (KD) processes. To this end, we propose an M-DAT approach to non-autoregressive multilingual machine translation. Our system leverages the recent advance of the directed acyclic Transformer (DAT), which does not require KD. We further propose a pivot back-translation (PivotBT) approach to improve the generalization to unseen translation directions. Experiments show that our M-DAT achieves state-of-the-art performance in non-autoregressive MNMT.
Problem

Research questions and friction points this paper is trying to address.

Eliminate knowledge distillation in multilingual machine translation.
Enhance efficiency using non-autoregressive Transformers.
Improve generalization to unseen translation directions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-autoregressive Transformers improve MNMT efficiency
Directed acyclic Transformer avoids knowledge distillation
Pivot back-translation enhances generalization capabilities
πŸ”Ž Similar Papers
No similar papers found.
Chenyang Huang
Chenyang Huang
Ph.D. Student, University of Alberta
MLDLNLPCV
F
Fei Huang
Damo Academy, Alibaba
Zaixiang Zheng
Zaixiang Zheng
ByteDance Seed
MLNLPAI for Science
O
Osmar R. ZaΓ―ane
Dept. of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta
H
Hao Zhou
Institute for AI Industry Research (AIR), Tsinghua University
Lili Mou
Lili Mou
University of Alberta
Natural Language ProcessingMachine Learning