Multilingual Performance Biases of Large Language Models in Education

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work identifies multilingual performance disparities of large language models (LLMs) in educational applications. We systematically evaluate leading LLMs across English and six non-English languages—Hindi, Arabic, Persian, Telugu, Ukrainian, and Czech—on four core educational tasks: misconception identification, feedback generation, interactive tutoring, and translation scoring. Using a multilingual human-annotated benchmark, we find that model performance strongly correlates with the volume of corresponding language data in pretraining corpora; low-resource languages exhibit average performance drops of 18–37% relative to English. This constitutes the first cross-lingual, multi-task empirical analysis of LLMs in education. Our key contributions include: (i) establishing a reproducible, task- and language-specific evaluation framework for educational AI; (ii) diagnosing critical bottlenecks in multilingual deployment; and (iii) proposing the practice of “pre-deployment language–task alignment validation” to guide equitable, localized AI adoption in education.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.

Problem

Research questions and friction points this paper is trying to address.

Assess multilingual performance biases in educational LLMs

Evaluate LLM effectiveness in non-English educational tasks

Identify performance gaps between English and low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated LLMs in six non-English languages

Assessed performance on four educational tasks

Recommended pre-deployment language-specific validation

🔎 Similar Papers

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models