A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lagging defense against adversarial prompting and jailbreaking attacks on large language models (LLMs), which compromises benign response quality and incurs high deployment overhead, this paper proposes a real-time, self-tuning lightweight auditing framework. Departing from conventional fine-tuning or static classifiers, our method introduces a dynamic, adjustable auditing module that jointly integrates adversarial sample detection with feedback-driven online parameter optimization, enabling millisecond-level adaptive updates of defense policies. Experiments on the Gemini model demonstrate robust resistance against state-of-the-art jailbreaking techniques—including Ghost, Multistep, and Suffix—achieving a false positive rate below 2.1%, an average latency increase of less than 8 ms for benign prompts, and negligible degradation in BLEU and FactScore (both <0.5%). Our core contribution is the first realization of real-time safety alignment that is low-overhead, high-fidelity, and continuously evolvable via online adaptation.

Technology Category

Application Category

📝 Abstract
Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.
Problem

Research questions and friction points this paper is trying to address.

Detects adversarial prompts in real-time for LLMs
Adapts quickly to new attacks without degrading performance
Maintains lightweight training for scalable implementation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time self-tuning moderator framework
Lightweight training footprint defense
Adaptive minimally intrusive framework
🔎 Similar Papers
No similar papers found.