Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

High-performing open-source multimodal large language models (MLLMs) remain scarce for low-resource languages such as Basque. Method: This work proposes a lightweight, data-driven paradigm: autonomously constructing a high-quality Basque image–text dataset and performing end-to-end multimodal hybrid training using Llama-3.1-Instruct and the Basque language model Latxa as backbones—without Basque-specific instruction tuning. Contribution/Results: We find that only ~20% of Basque multimodal data suffices to achieve substantial performance gains, challenging the prevailing assumption that extensive language-specific supervision is required. The resulting model establishes new open-source state-of-the-art performance on Basque multimodal understanding tasks. All components—including the curated dataset, training code, and model checkpoints—are fully open-sourced. This work provides a reproducible, transferable methodology and empirical foundation for multimodal research in low-resource languages.

Technology Category

Application Category

📝 Abstract

Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.

Problem

Research questions and friction points this paper is trying to address.

Developing multimodal models for low-resource Basque language

Addressing performance gaps in open-source multimodal AI systems

Optimizing training data mixtures for underrepresented languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed custom image-text datasets for Basque

Used Llama and Latxa LLMs with varied data mixtures

Achieved strong results with minimal Basque multimodal data

🔎 Similar Papers

No similar papers found.