Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poorly understood mechanisms underlying pre-trained language and vision models in multimodal machine translation (MMT). We systematically investigate the impact of pre-trained encoder and decoder components within a unified MMT framework. Experiments on the Multi30K and CoMMuTE benchmarks for English–German and English–French translation compare training-from-scratch, frozen fine-tuning, and other strategies. Our study is the first to reveal an *asymmetry* in pre-training effects: decoder pre-training consistently improves translation fluency and accuracy, whereas encoder performance critically depends on image–text alignment quality. Furthermore, we characterize the interaction between modality fusion mechanisms and pre-trained modules, uncovering principled dependencies that inform architectural design. These findings provide empirical grounding and reusable optimization guidelines for developing effective MMT systems.

Technology Category

Application Category

📝 Abstract
Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating pre-trained encoders and decoders in multimodal machine translation.
Analyzing training strategies' impact on multimodal translation performance.
Exploring modality fusion and pre-trained components' interplay in MMT.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging pre-trained encoders and decoders
Analyzing training strategies for MMT
Exploring modality fusion effects
🔎 Similar Papers
No similar papers found.
Z
Zhuang Yu
Department of Automation, Shanghai Jiao Tong University
Shiliang Sun
Shiliang Sun
Shanghai Jiao Tong University
Machine LearningArtificial Intelligence
J
Jing Zhao
School of Computer Science and Technology, East China Normal University
Tengfei Song
Tengfei Song
Huawei
Emotion recognitionComputer visionGraph neural network
H
Hao Yang
2012 Labs, Huawei Technologies CO., LTD