Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current large language models struggle to accurately retrieve and verify existing theorems that are applicable to specific gaps in research-level mathematical proofs. This work proposes Re²Math, the first theorem retrieval benchmark tailored for research-level mathematics, which constructs hierarchical contexts and controllable prompts grounded in the actual proof structures of published papers. The approach innovatively decouples literature tool usage into three diagnosable dimensions: citation recall, source anchoring, and proof-gap alignment. It further introduces a leakage-controlled anchor prompting mechanism and a semantic matching strategy that prioritizes semantic adequacy over citation proximity. Evaluated on this benchmark, state-of-the-art systems achieve a ToolAcc of only 7.0%, revealing a significant bottleneck in models’ ability to judge theorem applicability and underscoring both the challenge and necessity of this new benchmark.

📝 Abstract

Large language models are increasingly capable at closed-world mathematical reasoning, but research assistance also requires source-grounded use of the literature. When a proof reaches a non-trivial step, a useful assistant should determine whether the needed tool (e.g., a lemma) already exists, identify a suitable scholarly source, and verify that its assumptions align with the current proof context. To rigorously evaluate such capabilities, we introduce Re$^2$Math, a benchmark for tool-grounded retrieval from partial mathematical proofs. Each instance is built from a candidate instrumental citation in the proof of a main theorem, with hierarchical context and an optional leakage-controlled anchor hint. We also make the task source-grounded yet citation-agnostic in that any admissible theorem sufficient for the proof transition is accepted. Evaluation uses a release-frozen retrieval artifact, ensuring reproducibility, while the benchmark itself supports automatic, continual expansion with newly constructed instances. On the current benchmark test set, the best fixed-judge ToolAcc reaches 7.0%, despite substantially higher rates of source grounding, indicating that current systems often retrieve valid statements but fail to establish their applicability to the local proof step. By decoupling citation recall, grounding, and proof-gap sufficiency, Re$^2$Math transforms literature-grounded mathematical tool use into a controlled diagnostic task.

Problem

Research questions and friction points this paper is trying to address.

theorem retrieval

mathematical reasoning

research-level mathematics

proof assistance

source grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

theorem retrieval

mathematical reasoning

source-grounded benchmark