🤖 AI Summary
This paper addresses multilingual social media claim normalization—a critical preprocessing task for fact-checking. We propose a retrieval-first, large language model (LLM)-assisted lightweight framework that dynamically integrates in-context learning from GPT-4o-mini with nearest-neighbor retrieval from the training set. Leveraging few-shot prompting and semantic similarity matching, it efficiently maps noisy, multilingual claims to standardized canonical forms, thereby improving downstream credibility classification. Our key contributions are threefold: (1) the first integration of retrieval-augmented generation with lightweight LLMs for cross-lingual claim normalization; (2) state-of-the-art performance—ranked first overall in the monolingual track across 13 languages, with substantial gains in normalization accuracy; and (3) empirical validation that data-aware prompting significantly enhances robustness for low-resource languages. However, zero-shot generalization remains limited, underscoring persistent dependencies on language coverage and training data distribution.
📝 Abstract
Claim normalization is an integral part of any automatic fact-check verification system. It parses the typically noisy claim data, such as social media posts into normalized claims, which are then fed into downstream veracity classification tasks. The CheckThat! 2025 Task 2 focuses specifically on claim normalization and spans 20 languages under monolingual and zero-shot conditions. Our proposed solution consists of a lightweight emph{retrieval-first, LLM-backed} pipeline, in which we either dynamically prompt a GPT-4o-mini with in-context examples, or retrieve the closest normalization from the train dataset directly. On the official test set, the system ranks near the top for most monolingual tracks, achieving first place in 7 out of of the 13 languages. In contrast, the system underperforms in the zero-shot setting, highlighting the limitation of the proposed solution.