HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing passive defenses against multi-round jailbreak attacks by proposing the first proactive deception-based defense framework for large language model (LLM) security, inspired by honeypot principles. The framework employs a four-agent collaborative architecture—comprising a Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer—to lure attackers into traps through strategic misdirection, while integrating adversarial interaction and resource consumption mechanisms to enhance defensive efficacy. Key contributions include the construction of MTJ-Pro, a multi-turn progressive jailbreak dataset, and the introduction of two novel evaluation metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC). Experimental results demonstrate that the proposed method reduces attack success rates by 68.77% on average across mainstream LLMs, while increasing MSR and ARC by 118.11% and 149.16%, respectively, maintaining strong robustness even under adaptive attacks.

Technology Category

Application Category

📝 Abstract
Jailbreak attacks pose significant threats to large language models (LLMs), enabling attackers to bypass safeguards. However, existing reactive defense approaches struggle to keep up with the rapidly evolving multi-turn jailbreaks, where attackers continuously deepen their attacks to exploit vulnerabilities. To address this critical challenge, we propose HoneyTrap, a novel deceptive LLM defense framework leveraging collaborative defenders to counter jailbreak attacks. It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer, each performing a specialized security role and collaborating to complete a deceptive defense. To ensure a comprehensive evaluation, we introduce MTJ-Pro, a challenging multi-turn progressive jailbreak dataset that combines seven advanced jailbreak strategies designed to gradually deepen attack strategies across multi-turn attacks. Besides, we present two novel metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC), which provide more nuanced assessments of deceptive defense beyond conventional measures. Experimental results on GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMa-3.1 demonstrate that HoneyTrap achieves an average reduction of 68.77% in attack success rates compared to state-of-the-art baselines. Notably, even in a dedicated adaptive attacker setting with intensified conditions, HoneyTrap remains resilient, leveraging deceptive engagement to prolong interactions, significantly increasing the time and computational costs required for successful exploitation. Unlike simple rejection, HoneyTrap strategically wastes attacker resources without impacting benign queries, improving MSR and ARC by 118.11% and 149.16%, respectively.
Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks
large language models
multi-turn attacks
LLM security
adaptive attackers
Innovation

Methods, ideas, or system contributions that make the work stand out.

deceptive defense
multi-agent LLM security
jailbreak mitigation
honeypot for LLMs
adaptive attacker resilience
🔎 Similar Papers
No similar papers found.
Siyuan Li
Siyuan Li
Shanghai Jiao Tong University
Trustworthy LLM AgentsEdge Intelligence
X
Xi Lin
Shanghai Jiao Tong University
J
Jun Wu
Shanghai Jiao Tong University
Z
Zehao Liu
Shanghai Jiao Tong University
Haoyu Li
Haoyu Li
Student, UIUC
Machine Learning
Tianjie Ju
Tianjie Ju
Shanghai Jiao Tong University
Natural Langeuage Processing
X
Xiang Chen
Zhejiang University
J
Jianhua Li
Shanghai Jiao Tong University