When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study systematically evaluates language models on fine-grained multilingual claim verification, spanning 25 languages and seven granular truthfulness categories. Using the X-Fact benchmark, we compare XLM-R, mT5, Llama 3.1, Qwen 2.5, and Mistral Nemo under both prompt engineering and fine-tuning paradigms. Results show that the lightweight, task-specialized XLM-R—orders of magnitude smaller in parameter count than LLMs—achieves a macro-F1 of 57.7%, substantially outperforming the best LLM (16.9%) and prior state-of-the-art (41.9%), yielding a 15.8-point gain. This finding challenges the “scale-as-capability” assumption, providing the first empirical evidence that architectural suitability and task specialization can dominate parameter count in multilingual fine-grained fact-checking. The work establishes a new paradigm for efficient, deployable fact-checking systems and introduces a rigorous, multilingual benchmark for future research.

Technology Category

Application Category

📝 Abstract

The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating language models on multilingual claim verification

Assessing fine-grained veracity across diverse languages

Comparing small vs large models for fact-checking effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates multilingual models on fine-grained verification

Compares small encoder models with large decoder models

XLM-R outperforms larger models in multilingual verification

🔎 Similar Papers

Claim Verification in the Age of Large Language Models: A Survey