New Wide-Net-Casting Jailbreak Attacks Risk Large Models

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
This study addresses the underexplored jailbreaking risks posed by wide-net querying—where an adversary simultaneously queries multiple large language models—and formally defines and systematically analyzes this attack paradigm for the first time. The work proposes a novel jailbreaking method based on multi-model collaborative querying, which crafts tailored prompts to exploit discrepancies in model responses, substantially increasing the success rate of generating harmful content. Experimental results demonstrate that, on widely used large language models without additional safeguards, the proposed approach achieves up to 100% jailbreaking success in certain configurations, revealing the severe threat of wide-net attacks and surpassing the limitations inherent in traditional single-model attack strategies.
📝 Abstract
Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the wide-net-casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide-net-casting scenario. With this tailored method, the jailbreak success rate can even reach 100\% in some experiments when targeting the large models without additional safeguards, exposing wide-net-casting as a distinct, high-risk scenario that warrants attention in future evaluation and defense research.
Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks
large language models
wide-net-casting
safety risks
adversarial querying
Innovation

Methods, ideas, or system contributions that make the work stand out.

wide-net-casting
jailbreak attacks
large language models
safety evaluation
adversarial querying
🔎 Similar Papers
No similar papers found.