AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the vulnerability of large language models (LLMs) to novel jailbreaking attacks post-deployment and the ineffectiveness of conventional static defenses, this paper proposes the first continual-learning-based adaptive guardian framework specifically designed for LLM jailbreaking mitigation. The framework integrates out-of-distribution (OOD) input detection, lightweight online fine-tuning, and dynamic model updating to enable rapid identification and evolutionary defense against emerging attacks. With only two model updates, it achieves 96% OOD detection accuracy while preserving over 85% of the original task F1 score—substantially outperforming existing baselines. Its core contribution lies in pioneering the application of continual learning to LLM runtime security, uniquely balancing defense timeliness, adaptability, and task-performance stability.

Technology Category

Application Category

📝 Abstract
Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.
Problem

Research questions and friction points this paper is trying to address.

Building adaptive runtime safety guards for LLM-powered software against jailbreak attacks
Addressing performance drops in existing guardrails when facing unseen jailbreak strategies
Creating post-deployment guardrails that dynamically adapt to emerging security threats
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects novel jailbreak attacks as out-of-distribution inputs
Learns to defend through a continual learning framework
Adapts to new attacks in just two update steps
R
Rui Yang
Monash University, Australia
Michael Fu
Michael Fu
The University of Melbourne
Software EngineeringDevSecOpsDeep LearningLanguage Models
C
Chakkrit Tantithamthavorn
Monash University, Australia
C
Chetan Arora
Monash University, Australia
G
Gunel Gulmammadova
Transurban, Australia
J
Joey Chua
Transurban, Australia