Feature-Aware Malicious Output Detection and Mitigation

📅 2025-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are vulnerable to jailbreaking attacks, leading to harmful outputs; existing RLHF-based fine-tuning struggles to reliably detect latent malicious intent. To address this, we propose Feature-aware Malicious-response Mitigation (FMM), a real-time inference-time defense that monitors hidden-layer activation features and employs a lightweight discriminator to model toxicity-associated token representations, triggering an adaptive rejection mechanism. Crucially, FMM introduces a novel, fine-tuning-free activation patching technique: during decoding, it dynamically injects rejection vectors into the activation space to immediately correct toxic generations. Evaluated across over ten prevalent jailbreaking attack variants, FMM achieves an average defense success rate exceeding 92% on multiple mainstream LLMs, while preserving original task performance with negligible degradation—helpfulness drops by less than 0.3%.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has brought significant benefits to various domains while introducing substantial risks. Despite being fine-tuned through reinforcement learning, LLMs lack the capability to discern malicious content, limiting their defense against jailbreak. To address these safety concerns, we propose a feature-aware method for harmful response rejection (FMM), which detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism. By employing a simple discriminator, we detect potential malicious traits during the decoding phase. Upon detecting features indicative of toxic tokens, FMM regenerates the current token. By employing activation patching, an additional rejection vector is incorporated during the subsequent token generation, steering the model towards a refusal response. Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques, while crucially maintaining the models' standard generation capabilities.
Problem

Research questions and friction points this paper is trying to address.

Detects malicious features in LLM outputs to prevent harmful content
Adaptively adjusts rejection mechanisms to counter jailbreak attacks
Maintains standard generation capabilities while enhancing safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature-aware method detects malicious content
Adaptive rejection mechanism adjusts response
Activation patching steers refusal response
🔎 Similar Papers
No similar papers found.
W
Weilong Dong
Tianjin University
Peiguang Li
Peiguang Li
Meituan Group
自然语言处理
Y
Yu Tian
Dept. of Computer Science and Technology, Institute for AI, Tsinghua University
Xinyi Zeng
Xinyi Zeng
Sichuan University
Medical Image SegmentationMedical Image ReconstructionMulti-modal Learning
F
Fengdi Li
Faculty of Information Technology, Monash University
Sirui Wang
Sirui Wang
Meituan
NLPLLM