QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses systematic alignment biases in existing “large language models as judges” approaches when evaluating advanced undergraduate to early graduate-level mathematical proofs, where model assessments often diverge from human expert judgments. The work proposes the first dual-dimension alignment evaluation framework tailored to university-level mathematical proofs, introducing QEDBench—a benchmark that integrates course-specific grading rubrics with expert-derived general knowledge criteria. Leveraging over 1,000 hours of human annotation and a dual-evaluation matrix involving seven human raters and five solvers, the study systematically analyzes scoring behaviors of prominent large language models, including Claude Opus 4.5 and Qwen 2.5 Max. Results reveal a pervasive positive scoring bias among AI judges (up to +0.36) and identify Gemini 3.0 Pro as the top performer in discrete mathematics (0.91), while GPT-5 Pro and Claude Sonnet 4.5 exhibit significant degradation in graph theory tasks (as low as 0.74 and 0.50, respectively). The benchmark and associated data are publicly released.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard"LLM-as-a-Judge"protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.
Problem

Research questions and friction points this paper is trying to address.

Alignment Gap
Automated Evaluation
Mathematical Proofs
LLM-as-a-Judge
Discrete Mathematics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Alignment Gap
QEDBench
LLM-as-a-Judge
mathematical proof evaluation
dual-rubric benchmark
🔎 Similar Papers
No similar papers found.
Santiago Gonzalez
Santiago Gonzalez
Computer Science PhD from UT Austin
Machine LearningEvolutionary ComputationWireless Sensor Networks
A
Alireza Amiri Bavandpour
Department of Computer Science, Yale University, New Haven, CT, USA
P
Peter Ye
Department of Computer Science, Yale University, New Haven, CT, USA
Edward Zhang
Edward Zhang
Student in ECE, Carnegie Mellon University
Machine Learning
R
Ruslans Aleksejevs
UC Berkeley
T
Todor Antić
Charles University
P
Polina Baron
The University of Chicago
Sujeet Bhalerao
Sujeet Bhalerao
University of Illinois at Urbana-Champaign
quantum information theory
S
Shubhrajit Bhattacharya
The University of Chicago
Z
Zachary Burton
MIT
J
John Byrne
University of Delaware
H
Hyungjun Choi
Princeton University
N
Nujhat Ahmed Disha
MIT
K
Koppany István Encz
USI, IDSIA
Yuchen Fang
Yuchen Fang
University of California, Berkeley
constrained stochastic optimizationhigh-dimensional statistics
Robert Joseph George
Robert Joseph George
California Institute of Technology
Machine LearningAI4ScienceAI4Math
Ebrahim Ghorbani
Ebrahim Ghorbani
K. N. Toosi University of Technology and Hamburg University of Technology
Graph TheoryAlgorithmsCombinatorics
A
Alan Goldfarb
UC Berkeley
J
Jing Guo
Universität Regensburg
Meghal Gupta
Meghal Gupta
U.C. Berkeley
Theoretical Computer Science
S
Stefano Huber
USI, IDSIA
A
Annika Kanckos
University of Helsinki
M
Minjung Kang
Illinois Institute of Technology
H
Hyun Jong Kim
University of Western Ontario
D
Dino Lorenzini
University of Georgia