TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge of detecting decompositional jailbreak attacks in fully anonymous, metadata-free settings, where adversaries split malicious objectives into multiple seemingly benign queries to evade detection. To counter this, the authors propose TwinGate, a state-aware dual-encoder defense framework that achieves efficient state tracking without relying on user identifiers or request linkage. TwinGate leverages asymmetric contrastive learning to cluster semantically dispersed yet intent-consistent malicious query fragments within a shared latent space, while employing a frozen encoder to suppress false positives caused by overlapping benign topics. Evaluated on a large-scale dataset comprising 3.62 million instructions and 8,600 distinct malicious intents, TwinGate substantially outperforms existing stateful and stateless defenses, achieving high recall, extremely low false-positive rates, strong robustness against adaptive attacks, and high throughput—all with minimal latency.

📝 Abstract

Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collectively reconstruct prohibited content. In real-world deployments, LLMs face a continuous, untraceable stream of fully anonymized and arbitrarily interleaved requests, infiltrated by covertly distributed adversarial queries. Under this rigorous threat model, state-of-the-art defensive strategies exhibit fundamental limitations. In the absence of trustworthy user metadata, they are incapable of tracking global historical contexts, while their deployment of generative models for real-time monitoring introduces computationally prohibitive overhead. To address this, we present TwinGate, a stateful dual-encoder defense framework. TwinGate employs Asymmetric Contrastive Learning (ACL) to cluster semantically disparate but intent-matched malicious fragments in a shared latent space, while a parallel frozen encoder suppresses false positives arising from benign topical overlap. Each request requires only a single lightweight forward pass, enabling the defense to execute in parallel with the target model's prefill phase at negligible latency overhead. To evaluate our approach and advance future research, we construct a comprehensive dataset of over 3.62 million instructions spanning 8,600 distinct malicious intents. Evaluated on this large-scale corpus under a strictly causal protocol, TwinGate achieves high malicious intent recall at a remarkably low false positive rate while remaining highly robust against adaptive attacks. Furthermore, our proposal substantially outperforms stateful and stateless baselines, delivering superior throughput and reduced latency.

Problem

Research questions and friction points this paper is trying to address.

decompositional jailbreaks

untraceable traffic

large language models

adversarial queries

stateful defense

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Contrastive Learning

Decompositional Jailbreaks

Stateful Defense