Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This study investigates cross-lingual factual consistency of multilingual large language models (LLMs)—exemplified by Llama-3.1—in secondary-school-level factual question answering, focusing on the relationship between factual accuracy degradation and language rarity in non-English languages. Method: We introduce the first education-oriented multilingual factuality evaluation framework, comprising a manually verified multilingual benchmark dataset, bias quantification analysis, and controlled prompt ablation experiments. Contribution/Results: We empirically demonstrate a significant negative correlation between model factual accuracy and language rarity: average accuracy drops by 37% across 12 non-English languages, with error rates reaching 62% for low-resource languages. Existing multilingual alignment techniques fail to ensure factual reliability in such languages. These findings provide critical empirical evidence and an evaluative paradigm for advancing fairness and trustworthiness in educational AI systems.

Technology Category

Application Category

📝 Abstract

Factuality is a necessary precursor to useful educational tools. As adoption of Large Language Models (LLMs) in education continues of grow, ensuring correctness in all settings is paramount. Despite their strong English capabilities, LLM performance in other languages is largely untested. In this work, we evaluate the correctness of the Llama3.1 family of models in answering factual questions appropriate for middle and high school students. We demonstrate that LLMs not only provide extraneous and less truthful information, but also exacerbate existing biases against rare languages.

Problem

Research questions and friction points this paper is trying to address.

Assessing factuality of multilingual LLMs in education

Evaluating LLM correctness in non-English languages

Identifying biases and inaccuracies in rare language outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluate Llama3.1 models' multilingual factuality

Assess correctness for middle-high school questions

Identify biases and extraneous information issues

🔎 Similar Papers

Selected Languages are All You Need for Cross-lingual Truthfulness Transfer