đ¤ AI Summary
Manual grading of STEM coursework exhibits prolonged turnaround times (averaging seven days), resulting in delayed feedback, hindered student iteration, and diminished learning outcomes. To address critical limitations of existing AI-powered educational toolsâincluding inadequate privacy safeguards, lack of model transparency, poor support for multi-format submissions (e.g., Markdown, LaTeX, Python), and low instructor engagementâthis paper introduces the first automated Jupyter Notebook grading framework integrating lightweight, locally deployable open-source LLMs (e.g., Phi-3, Llama-3) with programmable unit testing. Built upon the Jupyter API and secure sandboxed execution, the framework enables fully local, end-to-end deploymentâensuring institutional data remains on-premises and granting instructors full oversight and control. Empirical evaluation in a numerical computing course demonstrates sub-second feedback latency, over threefold improvement in instructor grading efficiency, and a 42% increase in student resubmission rates, validating both technical efficacy and pedagogical utility.
đ Abstract
Grading student assignments in STEM courses is a laborious and repetitive task for tutors, often requiring a week to assess an entire class. For students, this delay of feedback prevents iterating on incorrect solutions, hampers learning, and increases stress when exercise scores determine admission to the final exam. Recent advances in AI-assisted education, such as automated grading and tutoring systems, aim to address these challenges by providing immediate feedback and reducing grading workload. However, existing solutions often fall short due to privacy concerns, reliance on proprietary closed-source models, lack of support for combining Markdown, LaTeX and Python code, or excluding course tutors from the grading process. To overcome these limitations, we introduce PyEvalAI, an AI-assisted evaluation system, which automatically scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy. Our approach is free, open-source, and ensures tutors maintain full control over the grading process. A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.