SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

πŸ“… 2026-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of simultaneously ensuring safety and contextual appropriateness in large language models during real-world dialogue. To this end, the authors propose an adaptive safety control method applied at inference time, formulating response generation as a sequential decision-making process. A reinforcement learning agent dynamically selects prompt adjustment strategies, while a novel β€œanti-learning” mechanism iteratively refines prompts based on contextual feedback to suppress unsafe behaviors. Experimental results demonstrate that the proposed approach significantly enhances both the safety and quality of model responses across multiple mainstream large language models and diverse unsafe scenarios, outperforming existing prompt optimization techniques while maintaining a favorable trade-off between performance and computational efficiency.
πŸ“ Abstract
Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Safety Control
Inference-Time Adaptation
Dialogue Safety
Behavioural Unlearning
Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time control
reinforcement learning
prompt optimisation
behavioural unlearning
LLM safety
πŸ”Ž Similar Papers
No similar papers found.