Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs

๐Ÿ“… 2025-06-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates cross-lingual factual consistency of multilingual large language models (LLMs)โ€”exemplified by Llama-3.1โ€”in secondary-school-level factual question answering, focusing on the relationship between factual accuracy degradation and language rarity in non-English languages. Method: We introduce the first education-oriented multilingual factuality evaluation framework, comprising a manually verified multilingual benchmark dataset, bias quantification analysis, and controlled prompt ablation experiments. Contribution/Results: We empirically demonstrate a significant negative correlation between model factual accuracy and language rarity: average accuracy drops by 37% across 12 non-English languages, with error rates reaching 62% for low-resource languages. Existing multilingual alignment techniques fail to ensure factual reliability in such languages. These findings provide critical empirical evidence and an evaluative paradigm for advancing fairness and trustworthiness in educational AI systems.

Technology Category

Application Category

๐Ÿ“ Abstract
Factuality is a necessary precursor to useful educational tools. As adoption of Large Language Models (LLMs) in education continues of grow, ensuring correctness in all settings is paramount. Despite their strong English capabilities, LLM performance in other languages is largely untested. In this work, we evaluate the correctness of the Llama3.1 family of models in answering factual questions appropriate for middle and high school students. We demonstrate that LLMs not only provide extraneous and less truthful information, but also exacerbate existing biases against rare languages.
Problem

Research questions and friction points this paper is trying to address.

Assessing factuality of multilingual LLMs in education
Evaluating LLM correctness in non-English languages
Identifying biases and inaccuracies in rare language outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluate Llama3.1 models' multilingual factuality
Assess correctness for middle-high school questions
Identify biases and extraneous information issues
๐Ÿ”Ž Similar Papers
No similar papers found.