Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limitations of current vision–language foundation models and evaluation benchmarks, which are predominantly English-centric and computationally expensive, thereby hindering their applicability to low-resource languages in multimodal settings. To overcome these challenges, we propose a lightweight text–speech–vision tri-modal fusion framework that leverages a stack of efficient adapters for cross-modal alignment, coupled with cost-effective data construction and fine-tuning strategies to enable collaborative multimodal modeling under constrained computational resources. Our contributions include an open-source compact multilingual multimodal model, an end-to-end speech–text–LLM pipeline, a culturally aware evaluation benchmark, and a reproducible toolchain, collectively yielding significant improvements in multimodal understanding and generation for non-English languages.

📝 Abstract

Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech->text->LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

multilingual multimodal LLMs

tri-modality

compute-efficient AI

non-English evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual multimodal LLMs

low-resource languages

tri-modal alignment

adapter stacks

culture-aware evaluation

🔎 Similar Papers

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers

2024-05-17arXiv.orgCitations: 10