🤖 AI Summary
Existing mathematical reasoning benchmarks inadequately assess large language models’ (LLMs) true capabilities on Olympiad-level problems. To address this, we introduce OlymMATH—the first bilingual (Chinese/English) benchmark explicitly designed for International Mathematical Olympiad (IMO)-level difficulty, comprising 200 human-verified problems spanning algebra, geometry, combinatorics, and number theory. Problems are rigorously stratified into two tiers: AIME-level and high-difficulty, each accompanied by deterministic, verifiable numerical answers. Key contributions include: (i) the first difficulty-stratified Olympiad-scale benchmark; (ii) the first strictly parallel bilingual problem corpus; and (iii) an automated evaluation framework grounded in exact answer matching. Experiments reveal that state-of-the-art models—including DeepSeek-R1 and o3-mini—achieve less than 10% accuracy on the high-difficulty subset, exposing critical limitations in deep deductive reasoning. The benchmark is publicly released to foster rigorous evaluation and advancement of mathematical reasoning in LLMs.
📝 Abstract
In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.