No LLM Solved Yu Tsumura's 554th Problem

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the lack of rigorous benchmarks for evaluating large language models’ (LLMs) formal mathematical reasoning—particularly in non-combinatorial, proof-based problem solving. Method: We introduce Yu Tsumura Problem 554—a carefully designed benchmark question modeled after International Mathematical Olympiad (IMO) standards: it is non-combinatorial, requires concise logical derivation, has a publicly documented solution, and deliberately avoids memorization cues or superficial pattern matching to isolate genuine deductive reasoning. Contribution/Results: Systematic evaluation across state-of-the-art commercial and open-source LLMs reveals that none produce a correct formal proof, exposing a fundamental deficiency in chaining multi-step logical inferences beyond training data distribution. To our knowledge, this is the first mathematically grounded benchmark that simultaneously ensures verifiability, high difficulty, and full transparency—establishing a novel paradigm for assessing true reasoning capability beyond statistical memorization.

Technology Category

Application Category

📝 Abstract

We show, contrary to the optimism about LLM's problem-solving abilities, fueled by the recent gold medals that were attained, that a problem exists -- Yu Tsumura's 554th problem -- that a) is within the scope of an IMO problem in terms of proof sophistication, b) is not a combinatorics problem which has caused issues for LLMs, c) requires fewer proof techniques than typical hard IMO problems, d) has a publicly available solution (likely in the training data of LLMs), and e) that cannot be readily solved by any existing off-the-shelf LLM (commercial or open-source).

Problem

Research questions and friction points this paper is trying to address.

LLMs fail to solve Yu Tsumura's 554th problem

Problem matches IMO difficulty but lacks solution

Public solution exists yet LLMs cannot solve

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies unsolved problem by LLMs

Compares problem to IMO standards

Tests LLMs on accessible solution

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?