🤖 AI Summary
This work addresses the practical security vulnerability of large language model (LLM) agents to persistent, online harassment attacks in multi-turn interactive web applications. To this end, we introduce the first benchmark specifically designed for evaluating LLM robustness against multi-turn harassment. Methodologically, we innovatively integrate multi-agent simulation with game-theoretic modeling to develop three novel jailbreaking attack strategies: memory-induction, policy-planning, and targeted fine-tuning. These methods systematically expose the evolutionary mechanisms by which LLMs are progressively induced—across successive dialogues—to generate toxic outputs such as insults and incitement. Evaluated on models including LLaMA and Gemini using synthetic data and a hybrid assessment framework, our attacks achieve >95% success rates, 1–2% refusal rates, and substantial increases in toxicity. Notably, we uncover previously unreported vulnerabilities of closed-source models under multi-turn interaction—a finding that provides both theoretical grounding and empirical evidence for developing dynamic, adaptive safety defenses.
📝 Abstract
Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.