First Proof

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work proposes a novel methodology for evaluating the capability of AI systems to solve cutting-edge mathematical problems. Addressing the limitation of existing benchmarks—which often lack authenticity in reflecting real-world research scenarios—the authors construct a benchmark comprising ten challenging, previously unpublished mathematical problems derived from their own research, each with a known but confidential solution. To ensure evaluation integrity, they employ an encrypted answer mechanism that prevents data leakage and maintains fairness. This study represents the first effort to utilize genuine, unsolved research problems as a benchmark for assessing advanced mathematical reasoning in large language models, thereby establishing a more rigorous and realistic standard for measuring performance in high-level mathematical tasks.

Technology Category

Application Category

📝 Abstract

To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.

Problem

Research questions and friction points this paper is trying to address.

AI systems

research-level mathematics

math questions

natural research process

answer correctness

Innovation

Methods, ideas, or system contributions that make the work stand out.

research-level mathematics

AI evaluation benchmark

encrypted answers