Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical evaluation benchmarks assess only final answers, neglecting rigorous reasoning and formal proof generation capabilities. Method: This work presents the first systematic evaluation of large language models (LLMs) on the complete 2025 USAMO—six proof-based problems—emphasizing constructive proof synthesis over answer matching. We introduce a fine-grained, competition-oriented scoring framework integrating expert human annotation, stepwise reasoning trajectory analysis, and failure-mode attribution, applied to state-of-the-art reasoning models including o3-mini. Contribution/Results: All evaluated models achieve an average score below 5%, exposing fundamental deficiencies—including logical gaps, proof hallucinations, and structural incoherence—demonstrating that current LLMs remain incapable of performing high-fidelity, formally rigorous reasoning required for advanced mathematical problem solving.

Technology Category

Application Category

📝 Abstract
Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on rigorous math proof generation
Assessing failure modes in reasoning for USAMO problems
Identifying training artifacts affecting mathematical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating full-solution reasoning for math problems
Using expert human annotators for model assessment
Identifying failure modes in model reasoning traces
🔎 Similar Papers
No similar papers found.
Ivo Petrov
Ivo Petrov
PhD student, INSAIT, Sofia University
Gradient LeakageLLM Reasoning
Jasper Dekoninck
Jasper Dekoninck
PhD Student, ETH Zurich
large language modelsquantum computingevaluation
L
Lyuben Baltadzhiev
INSAIT, Sofia University "St. Kliment Ohridski"
Maria Drencheva
Maria Drencheva
INSAIT, Sofia University "St.Kliment Ohridski"
Kristian Minchev
Kristian Minchev
PhD student, INSAIT, Sofia University
machine learning
M
Mislav Balunovi'c
ETH Zurich, INSAIT, Sofia University "St. Kliment Ohridski"
N
Nikola Jovanovi'c
ETH Zurich
M
Martin T. Vechev
ETH Zurich, INSAIT, Sofia University "St. Kliment Ohridski"