🤖 AI Summary
This work addresses the poorly understood mechanisms underlying pre-trained language and vision models in multimodal machine translation (MMT). We systematically investigate the impact of pre-trained encoder and decoder components within a unified MMT framework. Experiments on the Multi30K and CoMMuTE benchmarks for English–German and English–French translation compare training-from-scratch, frozen fine-tuning, and other strategies. Our study is the first to reveal an *asymmetry* in pre-training effects: decoder pre-training consistently improves translation fluency and accuracy, whereas encoder performance critically depends on image–text alignment quality. Furthermore, we characterize the interaction between modality fusion mechanisms and pre-trained modules, uncovering principled dependencies that inform architectural design. These findings provide empirical grounding and reusable optimization guidelines for developing effective MMT systems.
📝 Abstract
Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.