Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

Formal verification of natural language mathematical answers is often hindered by insufficient, sparse, and unreliable Lean proof signals, making correctness assessment challenging. This work proposes COVCAL, a novel selector that integrates formal verification with statistical risk control for the first time. Leveraging diagnostic information from proof traces, COVCAL provides selective risk guarantees for accepted answers under limited sampling or abstains when uncertain. The method combines the conservative Bonferroni bound with the tighter dev-then-cal bound, calibrated according to automated formalization coverage. Experiments demonstrate that a specialized formalizer achieves 79% coverage, successfully operates in 17 out of 20 resampling trials, accepts approximately 48% of problems, and attains an accuracy of 0.98 among accepted answers.

📝 Abstract

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

Problem

Research questions and friction points this paper is trying to address.

Lean-as-Judge

natural-language mathematical reasoning

autoformalization

risk control

proof faithfulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

risk-controlled selection

autoformalization

Lean-as-Judge