🤖 AI Summary
This study presents the first systematic evaluation of frontier large language models (LLMs) on formal mathematical reasoning in doctoral-level theoretical computer science—specifically, randomized algorithms. Grounded in Motwani and Raghavan’s canonical textbook, we construct the first domain-specific LLM benchmark requiring rigorous, machine-verifiable LaTeX-formatted proofs. Our methodology introduces a multidimensional qualitative analysis framework—measuring hallucination rate, logical coherence, and conciseness—integrated with automated prompt engineering, formal proof generation, LaTeX parsing, and human verification protocols. Experimental results show that Gemini-3-Pro and Claude-Sonnet-4.5-Thinking achieve 66% proof correctness, while other leading models attain ~40%. All code, datasets, and generated proofs are publicly released, establishing a new benchmark and methodological foundation for assessing advanced mathematical reasoning capabilities in LLMs.
📝 Abstract
The rapid advancement of large language models (LLMs) has led to significant breakthroughs in automated mathematical reasoning and scientific discovery. Georgiev, G${ó}$mez-Serrano, Tao, and Wagner [GGSTW+25] demonstrate that AI systems can explore new constructions and improve existing bounds, illustrating the growing potential of LLMs to accelerate mathematical discovery. Similarly, Bubeck et al. [BCE+25] show that GPT-5 can meaningfully contribute to scientific workflows, from proposing hypotheses to generating proofs and analyses. Despite these advances, a rigorous evaluation of these models on canonical, graduate-level mathematical theory remains necessary to understand their baseline reasoning capabilities. In this paper, we present a comprehensive benchmark of four frontier models: GPT-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, and Grok-4 against the classic curriculum of Randomized Algorithms by Motwani and Raghavan [MR95].
We tasked each model with generating formal LaTeX proofs for a series of lemmas and exercises spanning the textbook. We find that while the top-tier models (Gemini, and Claude) achieve a high accuracy rate (approx. 66%), demonstrating a robust grasp of probabilistic method and formal logic, other models lag significantly in consistency (approx. 40%). We provide a qualitative analysis of the generated proofs, highlighting differences in conciseness, hallucination rates, and logical structure. Our results suggest that while frontier models have reached a threshold of proficiency suitable for graduate-level pedagogical assistance and formalization, significant variance exists in their reliability for rigorous mathematical derivation. The code and the full set of LLM-generated responses are open-sourced and publicly available at https://github.com/magiclinux/math_benchmark_probability.