A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the lagging defense against adversarial prompting and jailbreaking attacks on large language models (LLMs), which compromises benign response quality and incurs high deployment overhead, this paper proposes a real-time, self-tuning lightweight auditing framework. Departing from conventional fine-tuning or static classifiers, our method introduces a dynamic, adjustable auditing module that jointly integrates adversarial sample detection with feedback-driven online parameter optimization, enabling millisecond-level adaptive updates of defense policies. Experiments on the Gemini model demonstrate robust resistance against state-of-the-art jailbreaking techniques—including Ghost, Multistep, and Suffix—achieving a false positive rate below 2.1%, an average latency increase of less than 8 ms for benign prompts, and negligible degradation in BLEU and FactScore (both <0.5%). Our core contribution is the first realization of real-time safety alignment that is low-overhead, high-fidelity, and continuously evolvable via online adaptation.

Technology Category

Application Category

📝 Abstract

Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.

Problem

Research questions and friction points this paper is trying to address.

Detects adversarial prompts in real-time for LLMs

Adapts quickly to new attacks without degrading performance

Maintains lightweight training for scalable implementation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time self-tuning moderator framework

Lightweight training footprint defense

Adaptive minimally intrusive framework

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

PhD GenAI Research Scientist Intern

Databricks

SF Bay Area Hourly Rate$54—$60 USD

San Francisco, CA, USA

Authors to Follow