Bootstrapping Embeddings for Low Resource Languages

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of training effective embedding models for low-resource languages, which suffer from a scarcity of high-quality supervised data. To overcome this limitation, the authors propose leveraging large language models to generate synthetic triplets and introduce two novel techniques—Adapter Fusion and Cross-lingual Low-Rank Adaptation (XL-LoRA)—to enhance embedding model performance. Adapter Fusion improves generalization by integrating multilingual adapters, while XL-LoRA enables efficient cross-lingual knowledge transfer through low-rank adaptation. Experimental results demonstrate that the proposed approach significantly outperforms existing baselines across multiple low-resource languages and downstream tasks, offering a scalable and high-performing solution for embedding model development in data-scarce linguistic settings.

Technology Category

Application Category

📝 Abstract
Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
embedding models
supervised finetuning data
synthetic data
cross-lingual
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-resource languages
embedding models
synthetic data generation
adapter composition
cross-lingual finetuning
🔎 Similar Papers
No similar papers found.
M
Merve Basoz
School of Informatics, University of Edinburgh, UK
A
Andrew Horne
Edina, University of Edinburgh, UK
Mattia Opper
Mattia Opper
PhD Student ILCC, Edinburgh University
Machine LearningNLPGraph MLStructure InductionRepresentation Learning