No LLM Solved Yu Tsumura's 554th Problem

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of rigorous benchmarks for evaluating large language models’ (LLMs) formal mathematical reasoning—particularly in non-combinatorial, proof-based problem solving. Method: We introduce Yu Tsumura Problem 554—a carefully designed benchmark question modeled after International Mathematical Olympiad (IMO) standards: it is non-combinatorial, requires concise logical derivation, has a publicly documented solution, and deliberately avoids memorization cues or superficial pattern matching to isolate genuine deductive reasoning. Contribution/Results: Systematic evaluation across state-of-the-art commercial and open-source LLMs reveals that none produce a correct formal proof, exposing a fundamental deficiency in chaining multi-step logical inferences beyond training data distribution. To our knowledge, this is the first mathematically grounded benchmark that simultaneously ensures verifiability, high difficulty, and full transparency—establishing a novel paradigm for assessing true reasoning capability beyond statistical memorization.

Technology Category

Application Category

📝 Abstract
We show, contrary to the optimism about LLM's problem-solving abilities, fueled by the recent gold medals that were attained, that a problem exists -- Yu Tsumura's 554th problem -- that a) is within the scope of an IMO problem in terms of proof sophistication, b) is not a combinatorics problem which has caused issues for LLMs, c) requires fewer proof techniques than typical hard IMO problems, d) has a publicly available solution (likely in the training data of LLMs), and e) that cannot be readily solved by any existing off-the-shelf LLM (commercial or open-source).
Problem

Research questions and friction points this paper is trying to address.

LLMs fail to solve Yu Tsumura's 554th problem
Problem matches IMO difficulty but lacks solution
Public solution exists yet LLMs cannot solve
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies unsolved problem by LLMs
Compares problem to IMO standards
Tests LLMs on accessible solution
🔎 Similar Papers
2024-03-15arXiv.orgCitations: 2
Simon Frieder
Simon Frieder
University of Oxford
machine learning
W
William Hart
University of Cambridge