🤖 AI Summary
This work addresses the limitations of current vision–language foundation models and evaluation benchmarks, which are predominantly English-centric and computationally expensive, thereby hindering their applicability to low-resource languages in multimodal settings. To overcome these challenges, we propose a lightweight text–speech–vision tri-modal fusion framework that leverages a stack of efficient adapters for cross-modal alignment, coupled with cost-effective data construction and fine-tuning strategies to enable collaborative multimodal modeling under constrained computational resources. Our contributions include an open-source compact multilingual multimodal model, an end-to-end speech–text–LLM pipeline, a culturally aware evaluation benchmark, and a reproducible toolchain, collectively yielding significant improvements in multimodal understanding and generation for non-English languages.
📝 Abstract
Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech->text->LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.