GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal machine translation (MMT) faces two key bottlenecks: rigid inter-modal alignment impedes thorough vision–language fusion, and models rely exclusively on images during inference within the training domain, failing to generalize to image-free scenarios. This paper proposes GIIFT—a two-stage, graph-guided inductive image-free MMT framework. Its core innovations are: (1) constructing multimodal scene graphs to explicitly model structured semantic relationships between vision and language while preserving modality-specific characteristics; and (2) designing a cross-modal graph attention network adapter that enables inductive knowledge transfer from image-augmented training to image-free inference within a unified fusion space. On Multi30K English–French and English–German benchmarks, GIIFT achieves state-of-the-art performance for image-free MMT; on WMT benchmarks, it significantly outperforms text-only baselines—marking the first demonstration of high-quality, zero-image-dependent generalizable translation.

Technology Category

Application Category

📝 Abstract
Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.
Problem

Research questions and friction points this paper is trying to address.

Bridging modality gap in multimodal machine translation
Generalizing translation beyond trained multimodal domains
Achieving image-free translation with inductive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs multimodal scene graphs for integration
Uses Graph Attention Network adapter
Inductively generalizes to image-free domains
🔎 Similar Papers
No similar papers found.