Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding bottleneck in evaluating research-level mathematical solutions, which has traditionally relied on expert annotations for ground-truth verification. The authors propose an oracle-free evaluation framework that assesses candidate solutions by leveraging them as in-context examples to test generalization on semantically proximate, automatically verifiable mathematical problems, thereby computing a consequence-driven utility score. This approach constitutes the first unsupervised method capable of evaluating mathematical reasoning without access to ground-truth answers or human annotations. Evaluated on research-level mathematics benchmarks, the framework substantially outperforms existing reward models and LLM-as-a-judge baselines, achieving 76.3% Acc@1 and 79.6% AUC when applied to GPT-OSS-120B.

Technology Category

Application Category

📝 Abstract
Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.
Problem

Research questions and friction points this paper is trying to address.

oracle-free evaluation
research-level mathematics
solution quality assessment
mathematical reasoning
verification bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Consequence-Based Utility
oracle-free evaluation
research-level mathematics
in-context exemplar
downstream performance
🔎 Similar Papers
No similar papers found.