Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental trade-off in language model alignment between reducing the probability of harmful outputs and preserving average generation quality. To this end, we propose RePULSe—a reinforcement learning–based method that integrates a proposal model and probabilistic inference to actively guide sampling toward low-reward (i.e., unsafe or undesirable) sequences and explicitly suppress their generation probabilities. By learning a proposal distribution that targets such sequences and incorporating differentiable probability suppression, RePULSe jointly optimizes expected reward and safety. Empirical evaluation across multiple benchmarks demonstrates that RePULSe significantly reduces harmful output rates while simultaneously improving average reward scores and exhibiting enhanced adversarial robustness compared to state-of-the-art alignment methods. Its core innovation lies in unifying low-quality output identification, controllable sampling, and probabilistic suppression within a fully differentiable inference framework—enabling more efficient and robust alignment optimization.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.
Problem

Research questions and friction points this paper is trying to address.

Reducing undesirable outputs in language models
Improving tradeoff between reward and safety
Enhancing adversarial robustness in model alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses probabilistic inference to guide sampling
Augments RL loss with additional loss component
Reduces probability of low-reward outputs directly
🔎 Similar Papers
No similar papers found.
S
Stephen Zhao
University of Toronto and Vector Institute
A
Aidan Li
Université de Montréal and Mila
Rob Brekelmans
Rob Brekelmans
Vector Institute
Roger Grosse
Roger Grosse
Associate Professor, University of Toronto
Machine learning