Strategic Deflection: Defending LLMs from Logit Manipulation

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Logit-level jailbreaking attacks—such as direct token selection manipulation during generation to evade refusal-based defenses—pose a critical security threat to large language models (LLMs). To address this, we propose an active defense framework centered on strategic content redirection: instead of defaulting to refusal, the framework performs real-time logit-layer monitoring and semantics-aware regulation to dynamically steer model outputs toward semantically proximal yet harmless responses. Our approach integrates adversarial decoding with lightweight semantic calibration, requiring neither model fine-tuning nor architectural modification. Extensive experiments demonstrate that it significantly reduces attack success rates across diverse strong jailbreaking methods (average reduction of 72.4%), while preserving original task performance with negligible accuracy degradation (<0.5%). This achieves a robust trade-off between security and usability.

Technology Category

Application Category

📝 Abstract
With the growing adoption of Large Language Models (LLMs) in critical areas, ensuring their security against jailbreaking attacks is paramount. While traditional defenses primarily rely on refusing malicious prompts, recent logit-level attacks have demonstrated the ability to bypass these safeguards by directly manipulating the token-selection process during generation. We introduce Strategic Deflection (SDeflection), a defense that redefines the LLM's response to such advanced attacks. Instead of outright refusal, the model produces an answer that is semantically adjacent to the user's request yet strips away the harmful intent, thereby neutralizing the attacker's harmful intent. Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries. This work presents a critical shift in defensive strategies, moving from simple refusal to strategic content redirection to neutralize advanced threats.
Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against logit manipulation attacks
Neutralizing harmful intent without outright refusal
Maintaining model performance on benign queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Strategic Deflection neutralizes logit manipulation attacks
Semantically adjacent answers strip harmful intent
Maintains performance while lowering Attack Success Rate
🔎 Similar Papers
No similar papers found.
Y
Yassine Rachidy
International Artificial Intelligence Center of Morocco, Mohammed VI Polytechnic University, Rabat, Morocco
J
Jihad Rbaiti
International Artificial Intelligence Center of Morocco, Mohammed VI Polytechnic University, Rabat, Morocco
Y
Youssef Hmamouche
International Artificial Intelligence Center of Morocco, Mohammed VI Polytechnic University, Rabat, Morocco
F
Faissal Sehbaoui
AgriEdge, Mohammed VI Polytechnic University, Ben Guerir, Morocco
Amal El Fallah Seghrouchni
Amal El Fallah Seghrouchni
Full professor
Artificial IntelligenceAutonomous agentsMulti-Agent SystemsAmbient intelligence