MathArena: Evaluating LLMs on Uncontaminated Math Competitions

📅 2025-05-29
📈 Citations: 12
Influential: 0
📄 PDF
🤖 AI Summary
Public benchmark datasets (e.g., AIME 2024) suffer from widespread data leakage, confounding LLM mathematical reasoning evaluation with memorization effects. Method: We introduce the first contamination-free benchmark for mathematical reasoning, built on real-time released contest problems—149 unseen questions from five major competitions (AIME, SMT, USAMO, etc.)—governed by a strict decontamination protocol aligned with official contest release windows. Contribution/Results: We propose a novel multi-granularity scoring scheme jointly evaluating answer correctness and proof rigor, enabling the first standardized, systematic assessment of formal proof generation. Experiments reveal that while state-of-the-art models excel on uncontaminated problem-solving tasks (e.g., SMT 2025), their scores drop below 25% on USAMO 2025 proof-generation tasks—unambiguously exposing deficiencies in formal deductive reasoning. This stark performance gap validates the benchmark’s efficacy in disentangling genuine reasoning capability from dataset memorization.

Technology Category

Application Category

📝 Abstract
The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as SMT 2025 -- published well after model release dates -- demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On USAMO 2025, even top models score below 25%, far behind their performance on final-answer tasks. So far, we have evaluated 30 models across five competitions, totaling 149 problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' genuine math reasoning versus memorization
Evaluating proof-writing skills in mathematical tasks
Providing contamination-free benchmarks via real-time competition problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses recurring math competitions for real-time evaluation
Eliminates contamination risk with immediate problem testing
First benchmark to assess proof-writing capabilities
🔎 Similar Papers
No similar papers found.
M
Mislav Balunovi'c
ETH Zurich, INSAIT, Sofia University
Jasper Dekoninck
Jasper Dekoninck
PhD Student, ETH Zurich
large language modelsquantum computingevaluation
Ivo Petrov
Ivo Petrov
PhD student, INSAIT, Sofia University
Gradient LeakageLLM Reasoning
N
Nikola Jovanovi'c
ETH Zurich
M
Martin T. Vechev
ETH Zurich, INSAIT, Sofia University