Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the poorly understood mechanisms underlying pre-trained language and vision models in multimodal machine translation (MMT). We systematically investigate the impact of pre-trained encoder and decoder components within a unified MMT framework. Experiments on the Multi30K and CoMMuTE benchmarks for English–German and English–French translation compare training-from-scratch, frozen fine-tuning, and other strategies. Our study is the first to reveal an *asymmetry* in pre-training effects: decoder pre-training consistently improves translation fluency and accuracy, whereas encoder performance critically depends on image–text alignment quality. Furthermore, we characterize the interaction between modality fusion mechanisms and pre-trained modules, uncovering principled dependencies that inform architectural design. These findings provide empirical grounding and reusable optimization guidelines for developing effective MMT systems.

Technology Category

Application Category

📝 Abstract

Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating pre-trained encoders and decoders in multimodal machine translation.

Analyzing training strategies' impact on multimodal translation performance.

Exploring modality fusion and pre-trained components' interplay in MMT.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging pre-trained encoders and decoders

Analyzing training strategies for MMT

Exploring modality fusion effects

🔎 Similar Papers

No similar papers found.