How well can LLMs Grade Essays in Arabic?

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the performance of generative large language models (LLMs) on Arabic automated essay scoring (AES), using the real-student-response AR-AES dataset. Methodologically, it compares zero-shot, few-shot, and fine-tuning paradigms; introduces an English–Arabic mixed prompting strategy and instruction engineering augmented with explicit scoring guidelines; and analyzes how Arabic tokenization characteristics affect computational cost and scoring consistency. Key contributions include: (1) the first empirical benchmark of multiple LLMs—including ChatGPT, Llama, Aya, Jais, and ACEGPT—on Arabic AES; (2) evidence that a lightweight BERT-based baseline (QWK 0.88) substantially outperforms the best-performing LLM (ACEGPT, QWK 0.67), underscoring the critical role of task-specific architecture design; and (3) validation that multilingual prompting and format-adaptive instruction design significantly improve LLM scoring consistency and cross-essay generalization.

Technology Category

Application Category

📝 Abstract
This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data.
Problem

Research questions and friction points this paper is trying to address.

Arabic essay scoring
large language models
prompting methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic Essay Grading
Large Language Models Evaluation
Prompting Strategies
🔎 Similar Papers
No similar papers found.