Individual and Combined Effects of English as a Second Language and Typos on LLM Performance

๐Ÿ“… 2026-04-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses a critical gap in the evaluation of large language models (LLMs) by systematically investigating the non-additive joint effects of non-native English (ESL) input and spelling errorsโ€”two common challenges in real-world scenarios that are typically assessed in isolation. Leveraging the Trans-EnV framework, the authors generate eight ESL variants and inject spelling errors at three intensity levels using the MulTypo tool, evaluating model performance across both closed-ended and open-ended tasks. Results reveal that the co-occurrence of ESL characteristics and spelling errors leads to significantly exacerbated performance degradation, particularly in closed-ended tasks. Crucially, assessing either factor alone fails to accurately predict real-world behavior, demonstrating that standard English benchmarks substantially overestimate LLMsโ€™ practical capabilities.
๐Ÿ“ Abstract
Large language models (LLMs) are used globally, and because much of their training data is in English, they typically perform best on English inputs. As a result, many non-native English speakers interact with them in English as a second language (ESL), and these inputs often contain typographical errors. Prior work has largely studied the effects of ESL variation and typographical errors separately, even though they often co-occur in real-world use. In this study, we use the Trans-EnV framework to transform standard English inputs into eight ESL variants and apply MulTypo to inject typos at three levels: low, moderate, and severe. We find that combining ESL variation and typos generally leads to larger performance drops than either factor alone, though the combined effect is not simply additive. This pattern is clearest on closed-ended tasks, where performance degradation can be characterized more consistently across ESL variants and typo levels, while results on open-ended tasks are more mixed. Overall, these findings suggest that evaluations on clean standard English may overestimate real-world model performance, and that evaluating ESL variation and typographical errors in isolation does not fully capture model behavior in realistic settings.
Problem

Research questions and friction points this paper is trying to address.

English as a Second Language
typographical errors
LLM performance
real-world usage
combined effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

ESL variation
typographical errors
combined effects
LLM robustness
real-world evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Serena Liu
Harvard University, Boston, MA 02138, US
Yutong Yang
Yutong Yang
Mercedes-Benz AG R&D & University of Stuttgart
Computer VisionAutonomous Driving
P
Prisha Sheth
Harvard University, Boston, MA 02138, US
W
Weixuan Dong
Harvard University, Boston, MA 02138, US
M
Mingjiao Diao
Harvard University, Boston, MA 02138, US
X
Xinru Zhu
Harvard University, Boston, MA 02138, US
N
Nikhil Banga
Harvard University, Boston, MA 02138, US
O
Oscar Melendez
Harvard University, Boston, MA 02138, US
A
Arnav Sharma
Harvard University, Boston, MA 02138, US
M
Minda Zhao
Harvard University, Boston, MA 02138, US
M
Marina Lin
Harvard University, Boston, MA 02138, US
Mengyu Wang
Mengyu Wang
Assistant Professor, Harvard Medical School
Artificial IntelligenceMachine LearningOphthalmologyGlaucomaComputational Mechanics