Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study addresses the challenge of alleviating teacher grading burden and mitigating subjective bias by exploring automated scoring of Austrian upper-secondary A-level German essays. Leveraging standardized rubrics, it presents the first systematic evaluation of multiple open-source large language models—including DeepSeek-R1 32B, Qwen3 30B, Mixtral 8x7B, and Llama3.3 70B—on authentic examination data, employing diverse prompting strategies and contextual configurations for fine-grained scoring analysis. Experimental results indicate that the best-performing model achieves up to 40.6% agreement with human raters on individual scoring dimensions and 32.8% exact match on total scores. Although these performance levels remain insufficient for direct deployment in operational settings, the work establishes a critical benchmark and advances methodological understanding for automated essay scoring in non-English educational contexts.

Technology Category

Application Category

📝 Abstract

Automated Essay Scoring (AES) has been explored for decades with the goal to support teachers by reducing grading workload and mitigating subjective biases. While early systems relied on handcrafted features and statistical models, recent advances in Large Language Models (LLMs) have made it possible to evaluate student writing with unprecedented flexibility. This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation. A dataset of 101 anonymised student exams across three text types was processed and evaluated. Four LLMs, DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b and LLama3.3 70b, were evaluated with different contexts and prompting strategies. The LLMs were able to reach a maximum of 40.6% agreement with the human rater in the rubric-provided sub-dimensions, and only 32.8% of final grades matched the ones given by a human expert. The results indicate that even though smaller models are able to use standardised rubrics for German essay grading, they are not accurate enough to be used in a real-world grading environment.

Problem

Research questions and friction points this paper is trying to address.

Automated Essay Scoring

Large Language Models

German Essays

Rubric-based Evaluation

Austrian A-Level

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Essay Scoring

Large Language Models

Rubric-based Evaluation