SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of low detectability, poor generalization, and high computational overhead in jailbreaking attack detection against large language models (LLMs), this paper proposes a cognitive science–inspired hierarchical defense mechanism. The method emulates human multi-stage reasoning by establishing a three-phase safety assessment pipeline: *intent inference*, *self-reflection*, and *adaptive rewriting*—the first systematic integration of human decision-making principles into LLM security. Leveraging lightweight intent recognition, confidence-driven response rewriting, and multi-level safety judgment, it achieves robust defense with minimal computational cost. Evaluated on five representative jailbreaking attack categories, the approach significantly outperforms seven state-of-the-art baselines in attack blocking rate, while preserving user intent fidelity and output quality.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks, but their growing power also amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms. Existing defenses including input paraphrasing, multi step evaluation, and safety expert models often suffer from high computational costs, limited generalization, or rigid workflows that fail to detect subtle malicious intent embedded in complex contexts. Inspired by cognitive science findings on human decision making, we propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans. SafeBehavior decomposes safety evaluation into three stages: intention inference to detect obvious input risks, self introspection to assess generated responses and assign confidence based judgments, and self revision to adaptively rewrite uncertain outputs while preserving user intent and enforcing safety constraints. We extensively evaluate SafeBehavior against five representative jailbreak attack types including optimization based, contextual manipulation, and prompt based attacks and compare it with seven state of the art defense baselines. Experimental results show that SafeBehavior significantly improves robustness and adaptability across diverse threat scenarios, offering an efficient and human inspired approach to safeguarding LLMs against jailbreak attempts.
Problem

Research questions and friction points this paper is trying to address.

Simulating human multistage reasoning to mitigate jailbreak attacks in LLMs
Addressing computational cost and generalization limitations in existing defenses
Detecting subtle malicious intent through hierarchical safety evaluation stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulates human multistage reasoning for safety
Decomposes evaluation into intention inference and introspection
Adaptively revises uncertain outputs while preserving intent
🔎 Similar Papers
No similar papers found.
Qinjian Zhao
Qinjian Zhao
Kean University
J
Jiaqi Wang
University of Bremen, Bibliothekstraße 1, Bremen, 28359, Bremen, Germany
Z
Zhiqiang Gao
Wenzhou-Kean University, 88 Daxue Rd, Ouhai, Wenzhou, 325006, Zhejiang, China
Z
Zhihao Dou
Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, Ohio, USA
B
Belal Abuhaija
Wenzhou-Kean University, 88 Daxue Rd, Ouhai, Wenzhou, 325006, Zhejiang, China
Kaizhu Huang
Kaizhu Huang
Professor, Duke Kunshan University
Generalization & RobustnessStatistical Learning ThoeryTrustworthy AI