Multilingual Self-Taught Faithfulness Evaluators

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the reliance on manual annotation, English-centric bias, and poor cross-lingual generalization in factual evaluation of multilingual large language models (LLMs), this paper proposes the first self-supervised, annotation-free multilingual faithfulness evaluation framework. Methodologically, it constructs a training set from synthetically generated multilingual summaries and employs cross-lingual transfer learning combined with language-specific and mixed-language fine-tuning—enabling full training of multilingual evaluators using synthetic data only. Key contributions include: (i) the first empirical validation that LLMs’ general linguistic competence directly supports high-fidelity factual consistency judgment; and (ii) uncovering the mechanistic link between intrinsic multilingual representations and faithfulness assessment performance. Experiments across multilingual benchmarks demonstrate that our framework significantly outperforms existing English-dominant evaluators and machine-translation-based baselines, achieving an average 12.7% improvement in faithfulness classification accuracy.

Technology Category

Application Category

📝 Abstract
The growing use of large language models (LLMs) has increased the need for automatic evaluation systems, particularly to address the challenge of information hallucination. Although existing faithfulness evaluation approaches have shown promise, they are predominantly English-focused and often require expensive human-labeled training data for fine-tuning specialized models. As LLMs see increased adoption in multilingual contexts, there is a need for accurate faithfulness evaluators that can operate across languages without extensive labeled data. This paper presents Self-Taught Evaluators for Multilingual Faithfulness, a framework that learns exclusively from synthetic multilingual summarization data while leveraging cross-lingual transfer learning. Through experiments comparing language-specific and mixed-language fine-tuning approaches, we demonstrate a consistent relationship between an LLM's general language capabilities and its performance in language-specific evaluation tasks. Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
Problem

Research questions and friction points this paper is trying to address.

Addressing information hallucination in multilingual LLM outputs
Developing faithfulness evaluators without extensive labeled data
Enhancing cross-lingual transfer learning for evaluation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic multilingual summarization data
Leverages cross-lingual transfer learning
Improves over translation-based evaluation approaches
🔎 Similar Papers
No similar papers found.