🤖 AI Summary
This work addresses the common failure of large language models in mathematical reasoning due to misapplying lemma conclusions without verifying their premises. The authors frame lemma usage as a structured prediction task and propose a two-stage output architecture—first validating premises and then assessing conclusion applicability—augmented with a segment-aware reinforcement learning mechanism. By integrating loss masking and joint training on both natural language and formal proofs, the approach precisely identifies the source of reasoning errors. Evaluated on in-domain tasks and premise-perturbed scenarios, the method significantly outperforms baseline models, while achieving comparable or slightly improved performance in end-to-end mathematical reasoning. Ablation studies confirm the necessity of each proposed component.
📝 Abstract
Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.