🤖 AI Summary
This work challenges the prevailing assumption that larger sampling budgets inherently improve performance in preference optimization for mathematical reasoning, demonstrating that excessive sampling (e.g., Best-of-N with large N) can degrade results due to inefficient exploration, policy collapse, and amplification of verifier noise. The authors propose PACE, a novel approach that leverages failure cases to generate high-fidelity synthetic preference pairs under an extremely low sampling budget (2 < N < 3). Integrated within a proximal alignment framework, PACE achieves superior performance over DPO-R1 (N=16) using only approximately one-fifth of the computational resources, while exhibiting greater robustness to reward hacking and label noise.
📝 Abstract
Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.