Boosting Data Utilization for Multilingual Dense Retrieval

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The core challenge in multilingual dense retrieval lies in cross-lingual semantic alignment, yet existing approaches predominantly rely on complex model architectures while overlooking the critical role of training data quality in representation learning. This paper proposes a data-driven optimization paradigm that focuses on improving the quality of hard negative mining and the efficiency of mini-batch construction. Specifically, we enhance the contrastive learning framework with a language-aware hard negative mining strategy and introduce lightweight cross-lingual data augmentation. Crucially, our method requires no architectural modifications to the underlying retriever. Evaluated on the 16-language MIRACL benchmark, it significantly outperforms multiple strong baselines. Results demonstrate that high-quality training data—not just model capacity—plays a decisive role in achieving effective multilingual representation alignment. Our work provides a novel, efficient, and scalable pathway for multilingual dense retrieval, emphasizing data-centric optimization over model-centric complexity.

Technology Category

Application Category

📝 Abstract
Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.
Problem

Research questions and friction points this paper is trying to address.

Aligning multilingual representations in shared vector space
Improving negative sample quality for contrastive learning
Enhancing mini-batch data efficacy in dense retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality hard negative samples generation
Effective mini-batch data utilization
Multilingual dense retrieval enhancement
🔎 Similar Papers
No similar papers found.
C
Chao Huang
Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education; School of Computer Science and Technology, Beijing Jiaotong University
Fengran Mo
Fengran Mo
Ph.D. Student, Université de Montréal
Conversational AIInformation RetrievalNatural Language ProcessingMultilingualism
Y
Yufeng Chen
Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education; School of Computer Science and Technology, Beijing Jiaotong University
Changhao Guan
Changhao Guan
北京交通大学
Natural Language ProcessingLarge Language Models
Zhenrui Yue
Zhenrui Yue
University of Illinois Urbana-Champaign
Large Language ModelsInformation RetrievalRecommender Systems
X
Xinyu Wang
McGill University
Jinan Xu
Jinan Xu
Professor of School of Computer and Information Technology, Beijing Jiaotong University
NLPMachine TranslationLLM
K
Kaiyu Huang
Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education; School of Computer Science and Technology, Beijing Jiaotong University