Putnam-AXIOM: A Functional and Static Benchmark

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing LLM mathematical reasoning benchmarks suffer from saturation (some achieving >90% accuracy) and severe training-set contamination, undermining their validity for assessing genuine reasoning capabilities. Method: We introduce Putnam-AXIOM, a rigorous benchmark comprising 522 university-level competition problems. It features a novel dynamic variant generation protocol that produces functionally equivalent problem variants via programmatic variable perturbation. To enable robust evaluation, we propose an automated scoring method integrating static function analysis with natural language reasoning trace validation, formalized as Teacher-Forced Accuracy to mitigate memorization bias. Contribution/Results: Experiments reveal that the strongest model achieves only 41.9% accuracy on the original set—and drops sharply by 19.6 percentage points on the variant set. All ten nationally representative models exhibit statistically significant performance degradation, with non-overlapping confidence intervals. These results demonstrate Putnam-AXIOM’s robustness and scalability for evaluating true mathematical reasoning ability in LLMs.

Technology Category

Application Category

📝 Abstract

Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.

Problem

Research questions and friction points this paper is trying to address.

Develop contamination-resilient benchmark for LLM mathematical reasoning

Assess model performance on unseen, perturbed competition problems

Introduce lightweight metric for automated proof evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Putnam competition problems for benchmarking

Generates functional variants via programmatic perturbations

Introduces Teacher-Forced Accuracy for proof evaluation

🔎 Similar Papers

No similar papers found.