Towards Robust Mathematical Reasoning

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing mathematical reasoning benchmarks primarily assess simple answer matching, failing to adequately evaluate advanced reasoning and formal proof generation. To address this, we propose IMO-Bench—a rigorous evaluation framework aligned with International Mathematical Olympiad (IMO) standards—comprising two complementary stages: AnswerBench (assessing answer correctness) and ProofBench (evaluating proof quality). We introduce fine-grained, automated scoring criteria and release IMO-GradingBench, a dataset of 1,000 expert-annotated solutions. Leveraging the Gemini Deep Think model, we jointly optimize reasoning and scoring, achieving high agreement with human graders (Pearson *r* > 0.92). Our method attains 80.0% accuracy on AnswerBench and 65.7% on ProofBench—surpassing the best non-Gemini baselines by 6.9% and 42.4%, respectively—and marks the first demonstration of IMO gold-medal–level performance in an LLM. IMO-Bench establishes a new, robust benchmark for evaluating sophisticated mathematical reasoning.

Technology Category

Application Category

📝 Abstract

Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Advancing mathematical reasoning capabilities of foundation models

Creating robust benchmarks for International Mathematical Olympiad level problems

Developing automatic grading systems for proof-writing evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created IMO-Bench for advanced mathematical reasoning evaluation

Developed autograders using Gemini for proof assessment

Achieved gold-level IMO performance with Gemini Deep Think

🔎 Similar Papers

Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models

2024-07-12arXiv.orgCitations: 1

💼 Related Jobs

University - Applied AI Intern

Booz Allen Hamilton

$69,400.00 to $158,000.00 (annualized USD)

Remote / Hybrid / Onsite

Authors to Follow