🤖 AI Summary
This work addresses the lack of rigorous benchmarks for evaluating large language models’ (LLMs) formal mathematical reasoning—particularly in non-combinatorial, proof-based problem solving. Method: We introduce Yu Tsumura Problem 554—a carefully designed benchmark question modeled after International Mathematical Olympiad (IMO) standards: it is non-combinatorial, requires concise logical derivation, has a publicly documented solution, and deliberately avoids memorization cues or superficial pattern matching to isolate genuine deductive reasoning. Contribution/Results: Systematic evaluation across state-of-the-art commercial and open-source LLMs reveals that none produce a correct formal proof, exposing a fundamental deficiency in chaining multi-step logical inferences beyond training data distribution. To our knowledge, this is the first mathematically grounded benchmark that simultaneously ensures verifiability, high difficulty, and full transparency—establishing a novel paradigm for assessing true reasoning capability beyond statistical memorization.
📝 Abstract
We show, contrary to the optimism about LLM's problem-solving abilities, fueled by the recent gold medals that were attained, that a problem exists -- Yu Tsumura's 554th problem -- that a) is within the scope of an IMO problem in terms of proof sophistication, b) is not a combinatorics problem which has caused issues for LLMs, c) requires fewer proof techniques than typical hard IMO problems, d) has a publicly available solution (likely in the training data of LLMs), and e) that cannot be readily solved by any existing off-the-shelf LLM (commercial or open-source).