Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the absence of high-quality, large-scale benchmarks for evaluating large language models’ reasoning and abstention capabilities in authentic mathematical research. To this end, 64 mathematicians collaboratively constructed the Soohak benchmark, comprising 439 original research-level problems, including a challenging subset and a specially designed abstention test set. Evaluation employs a hierarchical framework assessing both solution correctness against standard answers and the ability to identify ill-posed or invalid questions. Experiments reveal that even state-of-the-art closed-source models achieve at most 30.4% accuracy on the challenging subset and exhibit poor abstention performance (all below 50%), while leading open-source models perform substantially worse (under 15%), highlighting significant limitations in current models’ capacity for research-level mathematical reasoning.

📝 Abstract

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

Problem

Research questions and friction points this paper is trying to address.

research-level mathematics

LLM evaluation

mathematical reasoning

ill-posed problems

benchmark scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

research-level mathematics

LLM benchmark

refusal capability