🤖 AI Summary
Current multimodal machine translation (MMT) approaches rely on real images as input, making them vulnerable to visual noise and limiting robustness and practicality due to the requirement of paired image–text data during inference. To address these limitations, we propose D2P-MMT: a diffusion-to-prompt MMT framework that conditions on the source text to generate semantically consistent reconstructed images via a pretrained diffusion model—thereby eliminating dependence on real images. We further design a dual-branch prompting mechanism to enhance cross-modal interaction between the source text and generated images, and introduce a distribution alignment loss to mitigate the modality gap between generated and real images. The proposed method significantly improves robustness to visual noise and generalization capability. On the Multi30K benchmark, D2P-MMT surpasses state-of-the-art methods, demonstrating superior translation quality and inference stability.
📝 Abstract
Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.