Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the fundamental trade-off in language model alignment between reducing the probability of harmful outputs and preserving average generation quality. To this end, we propose RePULSe—a reinforcement learning–based method that integrates a proposal model and probabilistic inference to actively guide sampling toward low-reward (i.e., unsafe or undesirable) sequences and explicitly suppress their generation probabilities. By learning a proposal distribution that targets such sequences and incorporating differentiable probability suppression, RePULSe jointly optimizes expected reward and safety. Empirical evaluation across multiple benchmarks demonstrates that RePULSe significantly reduces harmful output rates while simultaneously improving average reward scores and exhibiting enhanced adversarial robustness compared to state-of-the-art alignment methods. Its core innovation lies in unifying low-quality output identification, controllable sampling, and probabilistic suppression within a fully differentiable inference framework—enabling more efficient and robust alignment optimization.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.

Problem

Research questions and friction points this paper is trying to address.

Reducing undesirable outputs in language models

Improving tradeoff between reward and safety

Enhancing adversarial robustness in model alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses probabilistic inference to guide sampling

Augments RL loss with additional loss component

Reduces probability of low-reward outputs directly

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models