Chitranuvad: Adapting Multi-lingual LLMs for Multimodal Translation

📅 2025-02-27
🏛️ Conference on Machine Translation
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses multimodal translation from English to low-resource Indian languages (Hindi, Bengali, Malayalam). We propose a unified vision-language modeling framework that jointly leverages textual and visual inputs. Methodologically, we introduce the first architecture integrating a Vision Transformer (ViT) image encoder with lightweight learnable linear adapters to precisely align visual tokens into the hidden space of multilingual large language models (e.g., mBART, Qwen-MoE), enabling end-to-end joint training for three tasks: image captioning, text-only translation, and multimodal translation. Our design ensures both parameter efficiency and cross-modal semantic consistency. On the WAT2024 benchmark, our approach achieves state-of-the-art performance across all three tasks for Hindi, and attains leading results for Bengali and Malayalam, demonstrating strong generalization and practical efficacy in low-resource settings.

Technology Category

Application Category

📝 Abstract
In this work, we provide the system description of our submission as part of the English to Lowres Multimodal Translation Task at the Workshop on Asian Translation (WAT2024). We introduce Chitranuvad, a multimodal model that effectively integrates Multilingual LLM and a vision module for Multimodal Translation. Our method uses a ViT image encoder to extract visual representations as visual token embeddings which are projected to the LLM space by an adapter layer and generates translation in an autoregressive fashion. We participated in all the three tracks (Image Captioning, Text only and Multimodal translation tasks) for Indic languages (ie. English translation to Hindi, Bengali and Malyalam) and achieved SOTA results for Hindi in all of them on the Challenge set while remaining competitive for the other languages in the shared task.
Problem

Research questions and friction points this paper is trying to address.

Adapting multilingual LLMs for multimodal translation tasks.
Integrating vision modules with LLMs for enhanced translation accuracy.
Achieving state-of-the-art results in Indic language translations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Multilingual LLM with vision module
Uses ViT encoder for visual token embeddings
Achieves SOTA in Hindi multimodal translation
🔎 Similar Papers
No similar papers found.
Shaharukh Khan
Shaharukh Khan
Unknown affiliation
Machine LearningVLM
A
Ayush K Tarun
Krutrim AI, Bangalore, India
Ali Faraz
Ali Faraz
Data Scientist, Krutrim
Machine LearningLLMsLVMsComputer Vision
P
Palash Kamble
Krutrim AI, Bangalore, India
V
Vivek Dahiya
Krutrim AI, Bangalore, India
P
Praveen Pokala
Krutrim AI, Bangalore, India
Ashish Kulkarni
Ashish Kulkarni
Krutrim
Artificial intelligencemachine learningNatural Language Processing
Chandra Khatri
Chandra Khatri
Ola Krutrim AI
Artificial IntelligenceMulti-Modal AIConversational AIDeep LearningMachine Learning
A
Abhinav Ravi
Krutrim AI, Bangalore, India
S
Shubham Agarwal
Krutrim AI, Bangalore, India