🤖 AI Summary
This work addresses the challenge of enhancing the reasoning capabilities of large language models without fine-tuning while avoiding the high latency inherent in existing sampling methods. The authors propose Power-SMC, a training-free sequential Monte Carlo (SMC) approach that, for the first time, applies SMC to sequence-level power posterior sampling. By integrating an optimally tuned prefix proposal temperature, a stability analysis grounded in Rényi entropy, and an exponential bridge annealing schedule, Power-SMC achieves substantial improvements in reasoning performance—matching or surpassing Metropolis–Hastings sampling on the MATH500 benchmark—while maintaining low computational overhead, with decoding latency only 1.4 to 3.3 times that of standard greedy decoding.
📝 Abstract
Many recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights. A natural formalization is the sequence-level power distribution $\pi_\alpha(y\mid x)\propto p_\theta(y\mid x)^\alpha$ ($\alpha>1$), which concentrates mass on whole sequences instead of adjusting token-level temperature. Prior work shows that Metropolis--Hastings (MH) sampling from this distribution recovers strong reasoning performance, but at order-of-magnitude inference slowdowns. We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency. Power-SMC advances a small particle set in parallel, corrects importance weights token-by-token, and resamples when necessary, all within a single GPU-friendly batched decode. We prove that temperature $\tau=1/\alpha$ is the unique prefix-only proposal minimizing incremental weight variance, interpret residual instability via prefix-conditioned R\'enyi entropies, and introduce an exponent-bridging schedule that improves particle stability without altering the target. On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from $16$--$28\times$ to $1.4$--$3.3\times$ over baseline decoding.