Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This study addresses cross-lingual retrieval of verified claims for multilingual fact-checking—particularly in low-resource languages and on global topics (e.g., pandemics, conflicts). Recognizing the fundamental distinction between cross-lingual and multilingual retrieval, we propose two innovations: (1) a sentence-similarity-based negative sampling strategy tailored for cross-lingual settings; and (2) an LLM-driven cross-lingual re-ranking method. Crucially, we argue that cross-lingual tasks require dedicated modeling—not mere adaptation of multilingual approaches. Evaluated on a comprehensive benchmark spanning 47 languages and 283 language pairs, our method demonstrates that LLM-based re-ranking substantially outperforms supervised fine-tuning, achieving state-of-the-art overall performance. The framework delivers a scalable, robust cross-lingual retrieval paradigm for fact-checking in resource-scarce languages, advancing both practical applicability and theoretical grounding in cross-lingual information access.

Technology Category

Application Category

📝 Abstract

Retrieval of previously fact-checked claims is a well-established task, whose automation can assist professional fact-checkers in the initial steps of information verification. Previous works have mostly tackled the task monolingually, i.e., having both the input and the retrieved claims in the same language. However, especially for languages with a limited availability of fact-checks and in case of global narratives, such as pandemics, wars, or international politics, it is crucial to be able to retrieve claims across languages. In this work, we examine strategies to improve the multilingual and crosslingual performance, namely selection of negative examples (in the supervised) and re-ranking (in the unsupervised setting). We evaluate all approaches on a dataset containing posts and claims in 47 languages (283 language combinations). We observe that the best results are obtained by using LLM-based re-ranking, followed by fine-tuning with negative examples sampled using a sentence similarity-based strategy. Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.

Problem

Research questions and friction points this paper is trying to address.

Improving multilingual retrieval of fact-checked claims across 47 languages

Enhancing crosslingual performance using LLM-based re-ranking and fine-tuning

Comparing multilingual and crosslingual setups for claim retrieval tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based re-ranking for multilingual retrieval

Fine-tuning with similarity-based negative examples

Crosslingual setup with unique characteristics

🔎 Similar Papers

No similar papers found.