Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study presents the first systematic end-to-end evaluation of vision-language models (VLMs) and large language models (LLMs) for automated assessment of handwritten primary-school examinations in under-resourced regions of Indonesia. We construct a VLM→LLM pipeline—where a VLM interprets handwritten images from 646 Grade-4 mathematics and English exams (containing >14,000 multi-format responses), and an LLM generates scores and feedback. Results reveal that VLM recognition errors significantly propagate into downstream scoring inaccuracies, exposing a critical modality-transformation bottleneck; while the LLM produces basic feedback under noisy input, its capacity for personalization and contextual adaptation remains limited. Our core contributions are: (1) establishing an empirical multimodal assessment benchmark for low-resource educational settings; (2) quantifying the cascading impact of handwriting recognition errors on automated grading performance; and (3) empirically validating both the feasibility and inherent limitations of VLM–LLM collaboration in authentic classroom environments.

Technology Category

Application Category

📝 Abstract

Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.

Problem

Research questions and friction points this paper is trying to address.

Assess VLM and LLM performance in Indonesian classroom grading

Evaluate handwritten answer recognition accuracy in student exams

Analyze LLM-generated feedback quality and personalization limits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated VLM and LLM for student assessment

Tested on 646 handwritten Indonesian exam sheets

Generated personalized feedback despite handwriting errors

🔎 Similar Papers

No similar papers found.