Bootstrapping Embeddings for Low Resource Languages

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the challenge of training effective embedding models for low-resource languages, which suffer from a scarcity of high-quality supervised data. To overcome this limitation, the authors propose leveraging large language models to generate synthetic triplets and introduce two novel techniques—Adapter Fusion and Cross-lingual Low-Rank Adaptation (XL-LoRA)—to enhance embedding model performance. Adapter Fusion improves generalization by integrating multilingual adapters, while XL-LoRA enables efficient cross-lingual knowledge transfer through low-rank adaptation. Experimental results demonstrate that the proposed approach significantly outperforms existing baselines across multiple low-resource languages and downstream tasks, offering a scalable and high-performing solution for embedding model development in data-scarce linguistic settings.

Technology Category

Application Category

📝 Abstract

Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

embedding models

supervised finetuning data

synthetic data

cross-lingual

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-resource languages

embedding models

synthetic data generation