🤖 AI Summary
To address the high noise and spurious correlations in self-generated data for mathematical reasoning with large language models (LLMs), stemming from scarcity of high-quality queries, this paper proposes HIT-Explore: a hit-guided, multi-stage exploration mechanism. HIT-Explore models intermediate reasoning states and uses sparse feedback—whether an intermediate result hits a correct sub-goal—to dynamically steer search trajectories and perform reinforcement-learning–style credit assignment. It requires no external annotations or auxiliary value models, thereby overcoming the traditional trade-off between accuracy and exploration diversity in alignment methods. Evaluated across multiple backbone models and benchmarks, HIT-Explore significantly improves both single-step inference accuracy and the diversity of effective reasoning paths, demonstrating the feasibility and effectiveness of scaling up low-noise, self-generated data for training.
📝 Abstract
Large Language Models (LLMs) exhibit strong potential in mathematical reasoning, yet their effectiveness is often limited by a shortage of high-quality queries. This limitation necessitates scaling up computational responses through self-generated data, yet current methods struggle due to spurious correlated data caused by ineffective exploration across all reasoning stages. To address such challenge, we introduce extbf{MARGE}: Improving extbf{Ma}th extbf{R}easoning with extbf{G}uided extbf{E}xploration, a novel method to address this issue and enhance mathematical reasoning through hit-guided exploration. MARGE systematically explores intermediate reasoning states derived from self-generated solutions, enabling adequate exploration and improved credit assignment throughout the reasoning process. Through extensive experiments across multiple backbone models and benchmarks, we demonstrate that MARGE significantly improves reasoning capabilities without requiring external annotations or training additional value models. Notably, MARGE improves both single-shot accuracy and exploration diversity, mitigating a common trade-off in alignment methods. These results demonstrate MARGE's effectiveness in enhancing mathematical reasoning capabilities and unlocking the potential of scaling self-generated training data. Our code and models are available at href{https://github.com/georgao35/MARGE}{this link}.