Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

📅 2023-08-30

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study investigates the impact of visual context on noise-robust multimodal neural machine translation (NMT) from English to Hindi, Bengali, and Malayalam. Addressing translation degradation under textual noise, we propose a visual feature fusion method built upon pretrained unimodal NMT architectures, systematically injecting multi-level synthetic noise and comparing full-image versus cropped-region visual inputs. Key contributions: (1) Visual signals significantly enhance robustness under noise, yet their benefit is not strictly contingent on semantic alignment—random images also improve performance, indicating a noise-modulation mechanism rather than conventional grounding; (2) We introduce the first noise-adaptive visual feature selection strategy; (3) Our approach achieves new state-of-the-art results across all three language pairs, with up to +2.1 BLEU gains under noisy conditions, demonstrating the substantive value of multimodal modeling for low-resource Indian language translation.

📝 Abstract

The study investigates the effectiveness of utilizing multimodal information in Neural Machine Translation (NMT). While prior research focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model deal with textual noise. Multimodal models slightly outperform text-only models in noisy settings, even with random images. The study's experiments translate from English to Hindi, Bengali, and Malayalam, outperforming state-of-the-art benchmarks significantly. Interestingly, the effect of visual context varies with source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features work better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, opening up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information for improved translation in various environments.

Problem

Research questions and friction points this paper is trying to address.

Examines image impact on large-scale English-Indian NMT systems

Assesses if images aid noisy text translation robustness

Explores varying visual context effects under text noise levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates image features into NMT

Tests multimodal models with synthetic noise

Varies visual context based on noise level

🔎 Similar Papers

No similar papers found.