Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
This work systematically evaluates the security vulnerabilities of the open-source large language model GPT-OSS-20B under adversarial conditions. Method: We introduce Jailbreak Oracle, a novel evaluation framework integrating adversarial prompt engineering with fine-grained behavioral probing to rigorously assess model robustness. Contribution/Results: We identify and formally characterize five previously unreported failure modes—Quantization Frenzy, Reasoning Black Hole, Schrödinger Compliance, Reasoning Mirage, and Chain-Directed Prompt Leakage. Empirical analysis reveals structural deficiencies in logical reasoning, value alignment, and instruction following. Multiple attack vectors consistently bypass safety guardrails, inducing severe jailbreaking and misleading outputs. Our findings establish a new analytical paradigm for LLM security mechanistic studies and provide a reproducible, open diagnostic toolkit for rigorous safety assessment.

Technology Category

Application Category

📝 Abstract
OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Us- ing the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on GPT-OSS-20B models, leading to severe consequences.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-OSS-20B's security vulnerabilities under adversarial attacks
Identifying failure modes like quant fever and reasoning blackholes
Demonstrating exploitability of harmful behaviors in language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic LLM evaluation tool Jailbreak Oracle
Probes model behavior under adversarial conditions
Uncovers failure modes like quant fever
🔎 Similar Papers
No similar papers found.