Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Prior evaluations of large language models (LLMs) on mathematical reasoning lack rigorous, educationally grounded empirical validation in authentic undergraduate settings. Method: We conducted the first double-blind empirical study in a real undergraduate algorithms course, assessing GPT-4o and o1-preview on proof-based free-response tasks. Submissions were anonymized; teaching assistants performed blind grading with fine-grained error annotation, followed by statistical comparison against student performance. Contribution/Results: o1-preview passed all assessments and significantly outperformed the student median; GPT-4o consistently scored below the passing threshold. Both models exhibited systematic failures—including unsubstantiated claims and misleading logical steps—undermining pedagogical reliability. This work establishes the first strict, blind-evaluation paradigm for LLM proof capabilities within university instruction, delivering a reproducible assessment framework and critical failure insights for AI-augmented mathematics education.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) advance, their role in higher education, particularly in free-response problem-solving, requires careful examination. This study assesses the performance of GPT-4o and o1-preview under realistic educational conditions in an undergraduate algorithms course. Anonymous GPT-generated solutions to take-home exams were graded by teaching assistants unaware of their origin. Our analysis examines both coarse-grained performance (scores) and fine-grained reasoning quality (error patterns). Results show that GPT-4o consistently struggles, failing to reach the passing threshold, while o1-preview performs significantly better, surpassing the passing score and even exceeding the student median in certain exercises. However, both models exhibit issues with unjustified claims and misleading arguments. These findings highlight the need for robust assessment strategies and AI-aware grading policies in education.

Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-4o and o1-preview performance in university algorithms course

Assessing AI-generated solutions under blind grading by teaching assistants

Identifying model weaknesses like unjustified claims and misleading arguments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Blind grading of GPT-generated exam solutions

Comparison of GPT-4o and o1-preview performance

Analysis of error patterns in AI reasoning

🔎 Similar Papers

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms