Boosting Data Utilization for Multilingual Dense Retrieval

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

The core challenge in multilingual dense retrieval lies in cross-lingual semantic alignment, yet existing approaches predominantly rely on complex model architectures while overlooking the critical role of training data quality in representation learning. This paper proposes a data-driven optimization paradigm that focuses on improving the quality of hard negative mining and the efficiency of mini-batch construction. Specifically, we enhance the contrastive learning framework with a language-aware hard negative mining strategy and introduce lightweight cross-lingual data augmentation. Crucially, our method requires no architectural modifications to the underlying retriever. Evaluated on the 16-language MIRACL benchmark, it significantly outperforms multiple strong baselines. Results demonstrate that high-quality training data—not just model capacity—plays a decisive role in achieving effective multilingual representation alignment. Our work provides a novel, efficient, and scalable pathway for multilingual dense retrieval, emphasizing data-centric optimization over model-centric complexity.

Technology Category

Application Category

📝 Abstract

Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.

Problem

Research questions and friction points this paper is trying to address.

Aligning multilingual representations in shared vector space

Improving negative sample quality for contrastive learning

Enhancing mini-batch data efficacy in dense retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality hard negative samples generation

Effective mini-batch data utilization

Multilingual dense retrieval enhancement

🔎 Similar Papers

No similar papers found.