๐ค AI Summary
This work addresses the limitations of binary-label fine-tuning in dense retrieval models, which overlooks the inherently graded nature of relevance and suffers from significant performance sensitivity to threshold selection in multilingual settings. The study systematically investigates the impact of thresholding strategies when converting graded relevance judgments into binary labels for multilingual dense retrieval. It proposes integrating threshold calibration directly into the fine-tuning process, leveraging a large language modelโgenerated multilingual graded relevance dataset and contrastive learning. Experiments across monolingual, multilingual mixed, and cross-lingual retrieval scenarios demonstrate that appropriate threshold selection substantially improves retrieval effectiveness, reduces the required amount of labeled data, and mitigates the adverse effects of annotation noise.
๐ Abstract
Dense retrieval models are typically fine-tuned with contrastive learning objectives that require binary relevance judgments, even though relevance is inherently graded. We analyze how graded relevance scores and the threshold used to convert them into binary labels affect multilingual dense retrieval. Using a multilingual dataset with LLM-annotated relevance scores, we examine monolingual, multilingual mixture, and cross-lingual retrieval scenarios. Our findings show that the optimal threshold varies systematically across languages and tasks, often reflecting differences in resource level. A well-chosen threshold can improve effectiveness, reduce the amount of fine-tuning data required, and mitigate annotation noise, whereas a poorly chosen one can degrade performance. We argue that graded relevance is a valuable but underutilized signal for dense retrieval, and that threshold calibration should be treated as a principled component of the fine-tuning pipeline.