Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

254K/year

🤖 AI Summary

This work addresses the practical security vulnerability of large language model (LLM) agents to persistent, online harassment attacks in multi-turn interactive web applications. To this end, we introduce the first benchmark specifically designed for evaluating LLM robustness against multi-turn harassment. Methodologically, we innovatively integrate multi-agent simulation with game-theoretic modeling to develop three novel jailbreaking attack strategies: memory-induction, policy-planning, and targeted fine-tuning. These methods systematically expose the evolutionary mechanisms by which LLMs are progressively induced—across successive dialogues—to generate toxic outputs such as insults and incitement. Evaluated on models including LLaMA and Gemini using synthetic data and a hybrid assessment framework, our attacks achieve >95% success rates, 1–2% refusal rates, and substantial increases in toxicity. Notably, we uncover previously unreported vulnerabilities of closed-source models under multi-turn interaction—a finding that provides both theoretical grounding and empirical evidence for developing dynamic, adaptive safety defenses.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLM vulnerability to multi-turn online harassment attacks

Evaluating jailbreak methods that bypass safety guardrails in agents

Analyzing how attacked agents reproduce human-like aggression patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn harassment simulation using game theory

Three jailbreak methods attacking memory planning fine-tuning

Mixed-methods evaluation framework with qualitative analysis

🔎 Similar Papers

No similar papers found.