🤖 AI Summary
This work proposes a novel methodology for evaluating the capability of AI systems to solve cutting-edge mathematical problems. Addressing the limitation of existing benchmarks—which often lack authenticity in reflecting real-world research scenarios—the authors construct a benchmark comprising ten challenging, previously unpublished mathematical problems derived from their own research, each with a known but confidential solution. To ensure evaluation integrity, they employ an encrypted answer mechanism that prevents data leakage and maintains fairness. This study represents the first effort to utilize genuine, unsolved research problems as a benchmark for assessing advanced mathematical reasoning in large language models, thereby establishing a more rigorous and realistic standard for measuring performance in high-level mathematical tasks.
📝 Abstract
To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.