ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

πŸ“… 2026-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes a training-free, inference-time alignment method for language models that circumvents the high cost and instability of conventional reinforcement learning (RL) approaches. By leveraging energy-based guidance, the method directly samples from the optimal RL policy using the transition probability structure of masked language models. Key innovations include the first demonstration of training-free RL alignment, the introduction of an online Monte Carlo estimator for the energy term, and the integration of importance sampling with modern inference acceleration frameworks to enhance efficiency without compromising sample quality. Experimental results show significant improvements in generation quality across reasoning, programming, and scientific tasks, demonstrating the method’s effectiveness and broad applicability.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Language Model Alignment
Training-Free Inference
Test-Time Scaling
Energy-Based Sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free alignment
energy-guided sampling
test-time scaling
Monte Carlo estimation
importance sampling
πŸ”Ž Similar Papers
No similar papers found.