Almost Surely Safe Alignment of Large Language Models at Inference-Time

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing inference-time safety alignment methods for large language models (LLMs) rely heavily on costly RLHF, suffer from poor generalization, and lack formal safety guarantees. Method: This paper proposes InferenceGuard—a novel, fine-tuning-free, weight-frozen inference-time safety alignment paradigm. Its core innovation is the first formulation of a constrained Markov decision process (CMDP) in the latent space, equipped with formal safety proofs; it models safe response generation as a sequential decision problem under safety-state constraints, incorporating traceable implicit safety states and safety trajectory planning. Contribution/Results: The framework ensures “almost-sure safety”—i.e., safety probability asymptotically approaches 1. Extensive evaluation across multiple benchmarks demonstrates that InferenceGuard significantly outperforms existing inference-time alignment methods in safety rate, while preserving task performance with no statistically significant degradation.

Technology Category

Application Category

📝 Abstract
Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensures LLMs generate safe responses almost surely, i.e., with a probability approaching one. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process within the LLM's latent space. Crucially, we augment a safety state that tracks the evolution of safety constraints and enables us to demonstrate formal safety guarantees upon solving the MDP in the latent space. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses.
Problem

Research questions and friction points this paper is trying to address.

Language Model Safety
Response Bias
RLHF Alternatives
Innovation

Methods, ideas, or system contributions that make the work stand out.

InferenceGuard
answer safety
language models
🔎 Similar Papers
No similar papers found.
X
Xiaotong Ji
Huawei Noah’s Ark Lab, Imperial College London
S
Shyam Sundhar Ramesh
Huawei Noah’s Ark Lab, University College London
Matthieu Zimmer
Matthieu Zimmer
RL Research Scientist @ Huawei Noah’s Ark Lab
artificial intelligence : learningdevelopmental learningreinforcement learningneural networks
Ilija Bogunovic
Ilija Bogunovic
Assistant professor, UCL
Machine Learning
J
Jun Wang
University College London
H
H. Ammar
Huawei Noah’s Ark Lab