🤖 AI Summary
To address the lagging defense against adversarial prompting and jailbreaking attacks on large language models (LLMs), which compromises benign response quality and incurs high deployment overhead, this paper proposes a real-time, self-tuning lightweight auditing framework. Departing from conventional fine-tuning or static classifiers, our method introduces a dynamic, adjustable auditing module that jointly integrates adversarial sample detection with feedback-driven online parameter optimization, enabling millisecond-level adaptive updates of defense policies. Experiments on the Gemini model demonstrate robust resistance against state-of-the-art jailbreaking techniques—including Ghost, Multistep, and Suffix—achieving a false positive rate below 2.1%, an average latency increase of less than 8 ms for benign prompts, and negligible degradation in BLEU and FactScore (both <0.5%). Our core contribution is the first realization of real-time safety alignment that is low-overhead, high-fidelity, and continuously evolvable via online adaptation.
📝 Abstract
Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.