Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current mathematical reasoning models on International Mathematical Olympiad (IMO)-level problems, where high inference costs and entrapment in incorrect solution paths often lead to poor performance. The authors propose an efficient solving framework built upon general-purpose large language models (e.g., Gemini 3.0 Pro), incorporating a novel mechanism that extracts conjectures and validates them in isolated contexts. This approach effectively identifies and circumvents “cognitive traps” in the solver–evaluator pipeline, preventing the misjudgment of erroneous solutions during iterative refinement. Evaluated on the IMO-ProofBench Advanced benchmark, the method achieves a success rate of 67.1% at an average cost of approximately \$31 per problem, substantially outperforming both published and unpublished baselines—more than doubling the success rate of the next-best public method.

Technology Category

Application Category

📝 Abstract
In the past year, custom and unreleased math reasoning models reached gold medal performance on the International Mathematical Olympiad (IMO). Similar performance was then reported using large-scale inference on publicly available models but at prohibitive costs (e.g., 3000 USD per problem). In this work, we present an inference pipeline that attains best-in-class performance on IMO-style math problems at an average inference cost orders of magnitude below competing methods while using only general-purpose off-the-shelf models. Our method relies on insights about grader failure in solver-grader pipelines, which we call the Cognitive Well (iterative refinement converging to a wrong solution that the solver as well as the pipeline's internal grader consider to be basically correct). Our pipeline addresses these failure modes through conjecture extraction, wherein candidate lemmas are isolated from generated solutions and independently verified alongside their negations in a fresh environment (context detachment). On IMO-ProofBench Advanced (PB-Adv), our pipeline achieves 67.1 percent performance using Gemini 3.0 Pro with an average cost per question of approximately 31 USD. At the time of evaluation, this represented the state-of-the-art on PB-Adv among both public and unreleased models, and more than doubles the success rate of the next best publicly accessible pipeline, all at a fraction of the cost.
Problem

Research questions and friction points this paper is trying to address.

Cognitive Well
IMO-style math problems
off-the-shelf models
solver-grader pipelines
inference cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognitive Well
conjecture extraction
context detachment
solver-grader pipeline
off-the-shelf models
🔎 Similar Papers
No similar papers found.
X
Xingyu Dang
Princeton University, New Jersey, USA
R
Rohit Agarwal
Princeton University, New Jersey, USA
R
Rodrigo Porto
Princeton University, New Jersey, USA
Anirudh Goyal
Anirudh Goyal
Mila, Université de Montréal
Machine LearningDeep LearningDeep Reinforcement Learning
L
Liam H Fowl
Princeton Language and Intelligence
Sanjeev Arora
Sanjeev Arora
Professor of Computer Science, Princeton University
theoretical machine learningtheoretical computer science