๐ค AI Summary
This study investigates cross-lingual factual consistency of multilingual large language models (LLMs)โexemplified by Llama-3.1โin secondary-school-level factual question answering, focusing on the relationship between factual accuracy degradation and language rarity in non-English languages.
Method: We introduce the first education-oriented multilingual factuality evaluation framework, comprising a manually verified multilingual benchmark dataset, bias quantification analysis, and controlled prompt ablation experiments.
Contribution/Results: We empirically demonstrate a significant negative correlation between model factual accuracy and language rarity: average accuracy drops by 37% across 12 non-English languages, with error rates reaching 62% for low-resource languages. Existing multilingual alignment techniques fail to ensure factual reliability in such languages. These findings provide critical empirical evidence and an evaluative paradigm for advancing fairness and trustworthiness in educational AI systems.
๐ Abstract
Factuality is a necessary precursor to useful educational tools. As adoption of Large Language Models (LLMs) in education continues of grow, ensuring correctness in all settings is paramount. Despite their strong English capabilities, LLM performance in other languages is largely untested. In this work, we evaluate the correctness of the Llama3.1 family of models in answering factual questions appropriate for middle and high school students. We demonstrate that LLMs not only provide extraneous and less truthful information, but also exacerbate existing biases against rare languages.