Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

๐Ÿ“… 2026-05-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

185K/year
๐Ÿค– AI Summary
Current large language models struggle to accurately retrieve and verify existing theorems that are applicable to specific gaps in research-level mathematical proofs. This work proposes ReยฒMath, the first theorem retrieval benchmark tailored for research-level mathematics, which constructs hierarchical contexts and controllable prompts grounded in the actual proof structures of published papers. The approach innovatively decouples literature tool usage into three diagnosable dimensions: citation recall, source anchoring, and proof-gap alignment. It further introduces a leakage-controlled anchor prompting mechanism and a semantic matching strategy that prioritizes semantic adequacy over citation proximity. Evaluated on this benchmark, state-of-the-art systems achieve a ToolAcc of only 7.0%, revealing a significant bottleneck in modelsโ€™ ability to judge theorem applicability and underscoring both the challenge and necessity of this new benchmark.
๐Ÿ“ Abstract
Large language models are increasingly capable at closed-world mathematical reasoning, but research assistance also requires source-grounded use of the literature. When a proof reaches a non-trivial step, a useful assistant should determine whether the needed tool (e.g., a lemma) already exists, identify a suitable scholarly source, and verify that its assumptions align with the current proof context. To rigorously evaluate such capabilities, we introduce Re$^2$Math, a benchmark for tool-grounded retrieval from partial mathematical proofs. Each instance is built from a candidate instrumental citation in the proof of a main theorem, with hierarchical context and an optional leakage-controlled anchor hint. We also make the task source-grounded yet citation-agnostic in that any admissible theorem sufficient for the proof transition is accepted. Evaluation uses a release-frozen retrieval artifact, ensuring reproducibility, while the benchmark itself supports automatic, continual expansion with newly constructed instances. On the current benchmark test set, the best fixed-judge ToolAcc reaches 7.0%, despite substantially higher rates of source grounding, indicating that current systems often retrieve valid statements but fail to establish their applicability to the local proof step. By decoupling citation recall, grounding, and proof-gap sufficiency, Re$^2$Math transforms literature-grounded mathematical tool use into a controlled diagnostic task.
Problem

Research questions and friction points this paper is trying to address.

theorem retrieval
mathematical reasoning
research-level mathematics
proof assistance
source grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

theorem retrieval
mathematical reasoning
source-grounded benchmark
proof gap
citation-agnostic evaluation
๐Ÿ”Ž Similar Papers
2024-03-20Conference on Empirical Methods in Natural Language ProcessingCitations: 1