Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
This work addresses the limitations of current vision–language foundation models and evaluation benchmarks, which are predominantly English-centric and computationally expensive, thereby hindering their applicability to low-resource languages in multimodal settings. To overcome these challenges, we propose a lightweight text–speech–vision tri-modal fusion framework that leverages a stack of efficient adapters for cross-modal alignment, coupled with cost-effective data construction and fine-tuning strategies to enable collaborative multimodal modeling under constrained computational resources. Our contributions include an open-source compact multilingual multimodal model, an end-to-end speech–text–LLM pipeline, a culturally aware evaluation benchmark, and a reproducible toolchain, collectively yielding significant improvements in multimodal understanding and generation for non-English languages.
📝 Abstract
Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech->text->LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
multilingual multimodal LLMs
tri-modality
compute-efficient AI
non-English evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual multimodal LLMs
low-resource languages
tri-modal alignment
adapter stacks
culture-aware evaluation