Truth Knows No Language: Evaluating Truthfulness Beyond English

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This paper addresses the lack of evaluation frameworks for factual consistency of large language models (LLMs) in non-English languages by introducing the first professionally translated multilingual factuality benchmark—covering Basque, Catalan, Galician, and Spanish—and systematically evaluating 12 open-source LLMs. Methodologically, it integrates human evaluation, multiple-choice scoring, LLM-as-a-judge automated assessment, and cross-lingual knowledge generalization analysis. Key contributions include: (1) empirically disproving the “cliff-edge degradation” hypothesis by showing that factuality degradation in low-resource languages is significantly less severe than anticipated; (2) demonstrating that LLM-as-a-judge aligns more closely with human judgments than multiple-choice metrics; (3) establishing that high-quality machine translation enables scalable, reliable multilingual factuality evaluation; and (4) revealing that general-knowledge questions exhibit strong cross-lingual robustness, whereas context- and time-sensitive questions show marked performance variation, with overall accuracy following a resource-dependent gradient—highest in English and lowest in Basque.

Technology Category

Application Category

📝 Abstract

We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability. Dataset and code are publicly available under open licenses.

Problem

Research questions and friction points this paper is trying to address.

Evaluate truthfulness in non-English languages.

Compare LLM performance across diverse languages.

Assess machine translation for truthfulness benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends TruthfulQA to multiple languages

Uses LLM-as-a-Judge for truthfulness scoring

Employs machine translation for benchmark scalability

🔎 Similar Papers

Selected Languages are All You Need for Cross-lingual Truthfulness Transfer