🤖 AI Summary
This work addresses the fundamental trade-off in language model alignment between reducing the probability of harmful outputs and preserving average generation quality. To this end, we propose RePULSe—a reinforcement learning–based method that integrates a proposal model and probabilistic inference to actively guide sampling toward low-reward (i.e., unsafe or undesirable) sequences and explicitly suppress their generation probabilities. By learning a proposal distribution that targets such sequences and incorporating differentiable probability suppression, RePULSe jointly optimizes expected reward and safety. Empirical evaluation across multiple benchmarks demonstrates that RePULSe significantly reduces harmful output rates while simultaneously improving average reward scores and exhibiting enhanced adversarial robustness compared to state-of-the-art alignment methods. Its core innovation lies in unifying low-quality output identification, controllable sampling, and probabilistic suppression within a fully differentiable inference framework—enabling more efficient and robust alignment optimization.
📝 Abstract
Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.