Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work challenges the notion that safety alignment in large language models operates as a binary threshold mechanism, revealing instead an unstable region where minor input perturbations trigger erratic refusal behaviors. The authors introduce a multi-metric diagnostic framework that uncovers a previously unobserved “uncertainty–activation decoupling” phenomenon—characterized by high output uncertainty despite low internal safety activation—within this instability zone. Leveraging this insight, they devise a fragmented, scenario-anchored prompt attack strategy that requires no model-specific customization. Evaluated on HarmBench, the method substantially outperforms strong baselines, while achieving competitive performance on MM-SafetyBench, thereby demonstrating both the validity and transferability of the proposed mechanism.

📝 Abstract

Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near-binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi-metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic diagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet decreased internal safety activation, a decoupling phenomenon that explains why detection-based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene-anchored prompts without model-specific optimization. Furina outperforms strong single-turn and multi-turn baselines on HarmBench and achieves competitive results on MM-SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code is available at: https://github.com/0xCavaliers/Furina_Jailbreak.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

refusal instability

uncertainty

jailbreak attack

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty-driven instability

safety alignment

jailbreak attack