Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited scalability and insufficient linguistic granularity of existing automated essay scoring tools for Arabic, which struggle to accurately assess multidimensional writing traits such as organization, vocabulary, elaboration, and style. To overcome these limitations, the authors propose a structured prompt engineering framework that integrates standard prompting, hybrid prompting simulating multi-expert scoring, and rubric-guided strategies incorporating exemplar-based scoring. This approach directs large language models to perform trait-focused evaluation under both zero-shot and few-shot settings. Experimental results on the QAES dataset demonstrate that the proposed method significantly improves scoring consistency, with the Fanar-1-9B-Instruct model achieving the best performance under rubric guidance (QWK = 0.28, CI = 0.41), particularly excelling in discourse-level traits like elaboration and style.

Technology Category

Application Category

📝 Abstract
This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero-shot and few-shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three-tier prompting strategy (standard, hybrid, and rubric-guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi-agent evaluation with trait specialist raters, while the rubric-guided method incorporates scored exemplars to enhance model alignment. In zero and few-shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar-1-9B-Instruct achieves the highest trait level agreement in both zero and few-shot prompting (QWK = 0.28 and CI = 0.41), with rubric-guided prompting yielding consistent gains across all traits and models. Discourse-level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.
Problem

Research questions and friction points this paper is trying to address.

Automatic Essay Scoring
Arabic
Trait-Centric Evaluation
Prompt Engineering
Low-Resource Languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured prompting
trait-specific AES
rubric-guided prompting
Arabic language assessment
few-shot LLM evaluation
🔎 Similar Papers
No similar papers found.