HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the limitations of existing passive defenses against multi-round jailbreak attacks by proposing the first proactive deception-based defense framework for large language model (LLM) security, inspired by honeypot principles. The framework employs a four-agent collaborative architecture—comprising a Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer—to lure attackers into traps through strategic misdirection, while integrating adversarial interaction and resource consumption mechanisms to enhance defensive efficacy. Key contributions include the construction of MTJ-Pro, a multi-turn progressive jailbreak dataset, and the introduction of two novel evaluation metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC). Experimental results demonstrate that the proposed method reduces attack success rates by 68.77% on average across mainstream LLMs, while increasing MSR and ARC by 118.11% and 149.16%, respectively, maintaining strong robustness even under adaptive attacks.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks pose significant threats to large language models (LLMs), enabling attackers to bypass safeguards. However, existing reactive defense approaches struggle to keep up with the rapidly evolving multi-turn jailbreaks, where attackers continuously deepen their attacks to exploit vulnerabilities. To address this critical challenge, we propose HoneyTrap, a novel deceptive LLM defense framework leveraging collaborative defenders to counter jailbreak attacks. It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer, each performing a specialized security role and collaborating to complete a deceptive defense. To ensure a comprehensive evaluation, we introduce MTJ-Pro, a challenging multi-turn progressive jailbreak dataset that combines seven advanced jailbreak strategies designed to gradually deepen attack strategies across multi-turn attacks. Besides, we present two novel metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC), which provide more nuanced assessments of deceptive defense beyond conventional measures. Experimental results on GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMa-3.1 demonstrate that HoneyTrap achieves an average reduction of 68.77% in attack success rates compared to state-of-the-art baselines. Notably, even in a dedicated adaptive attacker setting with intensified conditions, HoneyTrap remains resilient, leveraging deceptive engagement to prolong interactions, significantly increasing the time and computational costs required for successful exploitation. Unlike simple rejection, HoneyTrap strategically wastes attacker resources without impacting benign queries, improving MSR and ARC by 118.11% and 149.16%, respectively.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

large language models

multi-turn attacks

LLM security

adaptive attackers

Innovation

Methods, ideas, or system contributions that make the work stand out.

deceptive defense

multi-agent LLM security

jailbreak mitigation