🤖 AI Summary
This work addresses the vulnerability of language models in multi-turn dialogues, where attackers can progressively steer models toward harmful behaviors through seemingly benign interactions, thereby evading existing safety mechanisms. To counter this threat, the authors propose TurnGate, a response-aware, turn-level defense framework. They first construct MTID, a multi-turn intent dataset annotated with attack trajectories, benign negative samples, and critical turning points. Leveraging this data, TurnGate is trained to jointly analyze dialogue context and candidate responses, enabling early and precise detection of malicious intent at the earliest feasible turn. Experimental results demonstrate that TurnGate significantly outperforms baseline methods in identifying concealed attacks while maintaining a low false rejection rate, and exhibits strong generalization across diverse domains, attack strategies, and target models.
📝 Abstract
Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.