Autograding Mathematical Induction Proofs with Natural Language Processing

πŸ“… 2024-06-11
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Timely, fine-grained automated feedback is critically needed in mathematics proof instruction, yet existing systems struggle to accurately assess free-form inductive proofs. Method: We introduce the first AI teaching assistant system for automated scoring of mathematical induction proofs. It builds upon four robust open-source large language models (e.g., LLaMA, Phi), incorporating a fine-tuning framework enhanced by mathematics-specific data augmentation, structured proof annotation, and human-grading alignment strategies. We systematically evaluate and optimize multiple LLMs for this taskβ€”a first in the literature. Contribution/Results: Our best-performing model achieves >75% scoring accuracy relative to human graders. Empirical studies demonstrate that its fine-grained feedback significantly improves student proof quality; both pedagogical effectiveness and student trust are empirically validated. This work establishes a reproducible, interpretable, and educationally practical paradigm for automated proof assessment.

Technology Category

Application Category

πŸ“ Abstract
In mathematical proof education, there remains a need for interventions that help students learn to write mathematical proofs. Research has shown that timely feedback can be very helpful to students learning new skills. While for many years natural language processing models have struggled to perform well on tasks related to mathematical texts, recent developments in natural language processing have created the opportunity to complete the task of giving students instant feedback on their mathematical proofs. In this paper, we present a set of training methods and models capable of autograding freeform mathematical proofs by leveraging existing large language models and other machine learning techniques. The models are trained using proof data collected from four different proof by induction problems. We use four different robust large language models to compare their performances, and all achieve satisfactory performances to various degrees. Additionally, we recruit human graders to grade the same proofs as the training data, and find that the best grading model is also more accurate than most human graders. With the development of these grading models, we create and deploy an autograder for proof by induction problems and perform a user study with students. Results from the study shows that students are able to make significant improvements to their proofs using the feedback from the autograder, but students still do not trust the AI autograders as much as they trust human graders. Future work can improve on the autograder feedback and figure out ways to help students trust AI autograders.
Problem

Research questions and friction points this paper is trying to address.

Autograding freeform mathematical induction proofs
Leveraging large language models for instant feedback
Improving student proof skills with AI autograders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large language models
Trains on induction proof data
Autogrades with NLP techniques
πŸ”Ž Similar Papers
No similar papers found.
C
Chenyan Zhao
Department of Computer Science, University of Illinois Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801, IL, USA
M
Mariana Silva
Department of Computer Science, University of Illinois Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801, IL, USA
S
Seth Poulsen
Department of Computer Science, Utah State University, 4205 Old Main Hill, Logan, 84322, UT, USA