PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work challenges the prevailing assumption that larger sampling budgets inherently improve performance in preference optimization for mathematical reasoning, demonstrating that excessive sampling (e.g., Best-of-N with large N) can degrade results due to inefficient exploration, policy collapse, and amplification of verifier noise. The authors propose PACE, a novel approach that leverages failure cases to generate high-fidelity synthetic preference pairs under an extremely low sampling budget (2 < N < 3). Integrated within a proximal alignment framework, PACE achieves superior performance over DPO-R1 (N=16) using only approximately one-fifth of the computational resources, while exhibiting greater robustness to reward hacking and label noise.

Technology Category

Application Category

📝 Abstract

Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.

Problem

Research questions and friction points this paper is trying to address.

Iterative Alignment

Mathematical Reasoning

Exploration Scaling

Preference Optimization

Distribution Shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

Corrective Exploration

Preference Optimization

Mathematical Reasoning