Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior evaluations of large language models (LLMs) on mathematical reasoning lack rigorous, educationally grounded empirical validation in authentic undergraduate settings. Method: We conducted the first double-blind empirical study in a real undergraduate algorithms course, assessing GPT-4o and o1-preview on proof-based free-response tasks. Submissions were anonymized; teaching assistants performed blind grading with fine-grained error annotation, followed by statistical comparison against student performance. Contribution/Results: o1-preview passed all assessments and significantly outperformed the student median; GPT-4o consistently scored below the passing threshold. Both models exhibited systematic failures—including unsubstantiated claims and misleading logical steps—undermining pedagogical reliability. This work establishes the first strict, blind-evaluation paradigm for LLM proof capabilities within university instruction, delivering a reproducible assessment framework and critical failure insights for AI-augmented mathematics education.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) advance, their role in higher education, particularly in free-response problem-solving, requires careful examination. This study assesses the performance of GPT-4o and o1-preview under realistic educational conditions in an undergraduate algorithms course. Anonymous GPT-generated solutions to take-home exams were graded by teaching assistants unaware of their origin. Our analysis examines both coarse-grained performance (scores) and fine-grained reasoning quality (error patterns). Results show that GPT-4o consistently struggles, failing to reach the passing threshold, while o1-preview performs significantly better, surpassing the passing score and even exceeding the student median in certain exercises. However, both models exhibit issues with unjustified claims and misleading arguments. These findings highlight the need for robust assessment strategies and AI-aware grading policies in education.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-4o and o1-preview performance in university algorithms course
Assessing AI-generated solutions under blind grading by teaching assistants
Identifying model weaknesses like unjustified claims and misleading arguments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Blind grading of GPT-generated exam solutions
Comparison of GPT-4o and o1-preview performance
Analysis of error patterns in AI reasoning
🔎 Similar Papers
No similar papers found.
M
Ming Ding
Department of Computer Science, ETH Zurich
Rasmus Kyng
Rasmus Kyng
ETH Zurich
F
Federico Soldà
Department of Computer Science, ETH Zurich
W
Weixuan Yuan
Department of Computer Science, ETH Zurich