SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenge of simultaneously ensuring safety and contextual appropriateness in large language models during real-world dialogue. To this end, the authors propose an adaptive safety control method applied at inference time, formulating response generation as a sequential decision-making process. A reinforcement learning agent dynamically selects prompt adjustment strategies, while a novel “anti-learning” mechanism iteratively refines prompts based on contextual feedback to suppress unsafe behaviors. Experimental results demonstrate that the proposed approach significantly enhances both the safety and quality of model responses across multiple mainstream large language models and diverse unsafe scenarios, outperforming existing prompt optimization techniques while maintaining a favorable trade-off between performance and computational efficiency.

📝 Abstract

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Safety Control

Inference-Time Adaptation

Dialogue Safety

Behavioural Unlearning

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time control

reinforcement learning

prompt optimisation