🤖 AI Summary
To address computational redundancy and poor translation quality for low-resource languages in multilingual machine translation (MT) and speech translation (ST), this paper proposes a hierarchical Transformer encoder tree. Built upon linguistic similarity, the architecture shares intermediate representations across languages and enables single-pass generation of multiple target-language translations, facilitating cross-lingual knowledge transfer and parameter-efficient sharing. We innovatively integrate this encoder tree into a non-autoregressive ST framework, combining a CTC-trained encoder-only Transformer, a wav2vec 2.0 speech encoder, and hierarchical parameter sharing. Experiments demonstrate that our approach achieves translation quality on par with autoregressive models on multilingual MT and ST benchmarks, while accelerating inference by 7–14× and substantially reducing computational cost.
📝 Abstract
Multilingual translation faces challenges of computational redundancy and limited accuracy for low-resource languages, especially in speech translation. To address this, we propose a novel hierarchical Transformer Encoder Tree (TET) combined with non-autoregressive encoder-only models trained with Connectionist Temporal Classification for multilingual translation. By sharing intermediate representations among linguistically similar target languages, TET can improve accuracy on low-resource languages, reduce computational redundancy, and allow generating all target languages in a single forward pass, thus eliminating sequential bottlenecks and improving parallelism. For speech translation, combining TET with a non-autoregressive speech recognition backbone (wav2vec2) shows promising results in terms of translation quality compared to autoregressive systems while being 7-14 times faster.