🤖 AI Summary
Prior evaluations of large language models (LLMs) on mathematical reasoning lack rigorous, educationally grounded empirical validation in authentic undergraduate settings. Method: We conducted the first double-blind empirical study in a real undergraduate algorithms course, assessing GPT-4o and o1-preview on proof-based free-response tasks. Submissions were anonymized; teaching assistants performed blind grading with fine-grained error annotation, followed by statistical comparison against student performance. Contribution/Results: o1-preview passed all assessments and significantly outperformed the student median; GPT-4o consistently scored below the passing threshold. Both models exhibited systematic failures—including unsubstantiated claims and misleading logical steps—undermining pedagogical reliability. This work establishes the first strict, blind-evaluation paradigm for LLM proof capabilities within university instruction, delivering a reproducible assessment framework and critical failure insights for AI-augmented mathematics education.
📝 Abstract
As large language models (LLMs) advance, their role in higher education, particularly in free-response problem-solving, requires careful examination. This study assesses the performance of GPT-4o and o1-preview under realistic educational conditions in an undergraduate algorithms course. Anonymous GPT-generated solutions to take-home exams were graded by teaching assistants unaware of their origin. Our analysis examines both coarse-grained performance (scores) and fine-grained reasoning quality (error patterns). Results show that GPT-4o consistently struggles, failing to reach the passing threshold, while o1-preview performs significantly better, surpassing the passing score and even exceeding the student median in certain exercises. However, both models exhibit issues with unjustified claims and misleading arguments. These findings highlight the need for robust assessment strategies and AI-aware grading policies in education.