VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the suboptimal performance of existing large vision-language models in translating multilingual text within web images, a limitation primarily attributed to their neglect of fine-grained visual details. To overcome this, the authors propose VaaWIT, an end-to-end framework that integrates a Dual-Stream Attention Module (DSAM) to effectively fuse multilingual semantic information with high-resolution visual features. Furthermore, a Vision-Aware Adapter (VAA) is introduced to enable precise multimodal alignment while keeping the parameters of the underlying large language model frozen, thereby ensuring computational efficiency. Experimental results demonstrate that VaaWIT significantly outperforms current open-source state-of-the-art methods across eight tasks on three public benchmarks, achieving performance on par with proprietary closed-source models.
📝 Abstract
Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.
Problem

Research questions and friction points this paper is trying to address.

Web image translation
multilingual translation
visual representation gap
character morphology
Large Vision-Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Stream Attention Module
Visual-Aware Adapter
Large Language Model Adaptation
Multilingual Web Image Translation
Parameter-Efficient Fine-Tuning
🔎 Similar Papers
No similar papers found.