Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Small multilingual models significantly underperform large models in cross-lingual retrieval. Method: We systematically investigate the impact of training data scale, negative sampling strategies, and task diversity—finding task diversity more critical than language diversity—and introduce a hard negative mining mechanism. Leveraging large-scale multilingual retrieval data, we design diverse retrieval tasks and apply efficient fine-tuning to train a compact 300M-parameter model. Contribution/Results: Our model achieves performance on par with or exceeding that of state-of-the-art 7B-parameter models across standard multilingual retrieval benchmarks. This marks the first instance where a small-scale model substantively closes the retrieval capability gap with much larger counterparts, establishing a new paradigm for efficient, resource-conscious multilingual retrieval.

Technology Category

Application Category

📝 Abstract

Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (<1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (>1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.

Problem

Research questions and friction points this paper is trying to address.

Enhancing small multilingual models for retrieval tasks

Overcoming performance gap between small and large models

Optimizing training strategies for multilingual embedding effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrofitting small multilingual models for retrieval

Using hard negatives to improve retrieval accuracy

Emphasizing task diversity over language diversity

🔎 Similar Papers

No similar papers found.

Authors to Follow