๐ค AI Summary
This study investigates the impact of visual context on noise-robust multimodal neural machine translation (NMT) from English to Hindi, Bengali, and Malayalam. Addressing translation degradation under textual noise, we propose a visual feature fusion method built upon pretrained unimodal NMT architectures, systematically injecting multi-level synthetic noise and comparing full-image versus cropped-region visual inputs. Key contributions: (1) Visual signals significantly enhance robustness under noise, yet their benefit is not strictly contingent on semantic alignmentโrandom images also improve performance, indicating a noise-modulation mechanism rather than conventional grounding; (2) We introduce the first noise-adaptive visual feature selection strategy; (3) Our approach achieves new state-of-the-art results across all three language pairs, with up to +2.1 BLEU gains under noisy conditions, demonstrating the substantive value of multimodal modeling for low-resource Indian language translation.
๐ Abstract
The study investigates the effectiveness of utilizing multimodal information in Neural Machine Translation (NMT). While prior research focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model deal with textual noise. Multimodal models slightly outperform text-only models in noisy settings, even with random images. The study's experiments translate from English to Hindi, Bengali, and Malayalam, outperforming state-of-the-art benchmarks significantly. Interestingly, the effect of visual context varies with source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features work better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, opening up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information for improved translation in various environments.