🤖 AI Summary
This work addresses the inconsistency in existing static weighted supervised fine-tuning (SFT) for reward learning from human feedback (RLHF), which fails to uniquely recover the Boltzmann target policy and thus misaligns with KL-regularized reinforcement learning objectives. To resolve this, the authors propose BOLT, an algorithm that introduces reference-policy sampling combined with Boltzmann weighting to construct, for the first time, an SFT objective that exactly matches the desired Boltzmann policy. The key contributions include identifying the unique weighting form under reference sampling that corresponds precisely to the Boltzmann policy, revealing the error decomposition structure under single-sample approximation, and establishing a theoretical connection between Boltzmann projection with refreshed sampling and KL-regularized mirror descent. Experiments validate the efficacy of the proposed weighting scheme, demonstrate the single-sample saturation phenomenon, and confirm performance gains and improved training efficiency from refreshed sampling.
📝 Abstract
Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight $\exp(r(x,y)/β)/Z(x)$. BOLT, a Boltzmann-Targeted SFT procedure, is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price $β\log(1/π^*(S_N\mid x))$ from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. This decomposition explains why extra SFT epochs cannot repair missing reference-policy coverage and exposes the temperature--coverage--variance frontier. When coverage needs adaptive sampling, refreshed Boltzmann projections become KL policy mirror descent; finite inner solves enter as additive drift from the exact mirror step. Single-run Qwen experiments provide projection evidence for the target-matched weight, one-shot saturation, refreshed-sampler gains, and optimization-time savings, within the stated single-run scope.