Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods struggle to effectively assess the risk of harmful assistance provided by tool-augmented LLM agents in complex, illicit tasks within multilingual, multi-turn settings. To address this gap, this work proposes STING—a framework that constructs multi-step illicit plans grounded in benign personas and employs adaptive probing alongside a judge agent to track execution progress, enabling multi-turn red-teaming. The approach innovatively models multi-turn jailbreaking as a “time-to-first-jailbreak” random variable and introduces novel metrics, including discovery curves, language-wise hazard ratio attribution, and restricted mean jailbreak discovery time. Experiments on AgentHarm demonstrate that STING significantly outperforms both single- and multi-turn baselines. Furthermore, multilingual evaluations reveal that low-resource languages are not universally more vulnerable—a finding that challenges prevailing assumptions.

Technology Category

Application Category

📝 Abstract
LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.
Problem

Research questions and friction points this paper is trying to address.

illicit assistance
multi-turn interactions
LLM agents
agent misuse
multilingual evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn red-teaming
illicit assistance
LLM agents
multilingual safety evaluation
STING framework
🔎 Similar Papers
No similar papers found.