🤖 AI Summary
The core challenge in multilingual dense retrieval lies in cross-lingual semantic alignment, yet existing approaches predominantly rely on complex model architectures while overlooking the critical role of training data quality in representation learning. This paper proposes a data-driven optimization paradigm that focuses on improving the quality of hard negative mining and the efficiency of mini-batch construction. Specifically, we enhance the contrastive learning framework with a language-aware hard negative mining strategy and introduce lightweight cross-lingual data augmentation. Crucially, our method requires no architectural modifications to the underlying retriever. Evaluated on the 16-language MIRACL benchmark, it significantly outperforms multiple strong baselines. Results demonstrate that high-quality training data—not just model capacity—plays a decisive role in achieving effective multilingual representation alignment. Our work provides a novel, efficient, and scalable pathway for multilingual dense retrieval, emphasizing data-centric optimization over model-centric complexity.
📝 Abstract
Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.