Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge of effectively representing and clustering semantically similar yet linguistically diverse rumor claims in multilingual fact-checking. We propose the first embedding model specifically designed for multilingual fact-checking claims, which fine-tunes a multilingual pretrained encoder via contrastive learning to construct an optimized semantic vector space that supports cross-lingual similarity computation and clustering. Experimental results demonstrate that our approach significantly outperforms 14 baseline embedding models and 7 clustering algorithms across three datasets, achieving notable improvements in both cluster label alignment and the geometric structure of the embedding space, thereby enabling effective cross-lingual knowledge transfer.

Technology Category

Application Category

📝 Abstract

Recurrent claims present a major challenge for automated fact-checking systems designed to combat misinformation, especially in multilingual settings. While tasks such as claim matching and fact-checked claim retrieval aim to address this problem by linking claim pairs, the broader challenge of effectively representing groups of similar claims that can be resolved with the same fact-check via claim clustering remains relatively underexplored. To address this gap, we introduce Claim2Vec, the first multilingual embedding model optimized to represent fact-check claims as vectors in an improved semantic embedding space. We fine-tune a multilingual encoder using contrastive learning with similar multilingual claim pairs. Experiments on the claim clustering task using three datasets, 14 multilingual embedding models, and 7 clustering algorithms demonstrate that Claim2Vec significantly improves clustering performance. Specifically, it enhances both cluster label alignment and the geometric structure of the embedding space across different cluster configurations. Our multilingual analysis shows that clusters containing multiple languages benefit from fine-tuning, demonstrating cross-lingual knowledge transfer.

Problem

Research questions and friction points this paper is trying to address.

fact-checking

claim clustering

multilingual

misinformation

semantic representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Claim2Vec

multilingual embedding

contrastive learning