Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing static mathematical evaluation benchmarks suffer from narrow coverage, susceptibility to saturation, and difficulty in updating, thereby failing to effectively track the progress of large language models in mathematical reasoning. This work proposes MathArena—a continuously evolving evaluation platform that integrates diverse sources including International Mathematical Olympiad (IMO) proof problems, research-level questions from arXiv, and Lean formalized proofs into a unified, standardized dynamic assessment protocol. The platform supports automated evaluation and performance tracking, with ongoing expansion of task types and difficulty levels to ensure assessments remain aligned with state-of-the-art model capabilities. Experiments demonstrate that the strongest model, GPT-5.5, achieves 98% accuracy on the 2026 USA Mathematical Olympiad and 74% on research-level problems, validating MathArena’s effectiveness in characterizing advanced mathematical reasoning abilities.

📝 Abstract

Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

evaluation platforms

mathematical reasoning

large language models

static benchmarks

model performance tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

evaluation platform

mathematical reasoning

large language models