Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the security vulnerabilities of the open-source large language model GPT-OSS-20B under adversarial conditions. Method: We introduce Jailbreak Oracle, a novel evaluation framework integrating adversarial prompt engineering with fine-grained behavioral probing to rigorously assess model robustness. Contribution/Results: We identify and formally characterize five previously unreported failure modes—Quantization Frenzy, Reasoning Black Hole, Schrödinger Compliance, Reasoning Mirage, and Chain-Directed Prompt Leakage. Empirical analysis reveals structural deficiencies in logical reasoning, value alignment, and instruction following. Multiple attack vectors consistently bypass safety guardrails, inducing severe jailbreaking and misleading outputs. Our findings establish a new analytical paradigm for LLM security mechanistic studies and provide a reproducible, open diagnostic toolkit for rigorous safety assessment.

Technology Category

Application Category

📝 Abstract
OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Us- ing the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on GPT-OSS-20B models, leading to severe consequences.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-OSS-20B's security vulnerabilities under adversarial attacks
Identifying failure modes like quant fever and reasoning blackholes
Demonstrating exploitability of harmful behaviors in language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic LLM evaluation tool Jailbreak Oracle
Probes model behavior under adversarial conditions
Uncovers failure modes like quant fever
🔎 Similar Papers
No similar papers found.
Shuyi Lin
Shuyi Lin
Northeastern University
System
Tian Lu
Tian Lu
Arizona State University
Human-AI collaborationImpact of AI and big dataFinTechE-commerceSharing economy
Z
Zikai Wang
Northeastern University
B
Bo Wen
Northeastern University
Y
Yibo Zhao
Northeastern University
C
Cheng Tan
Shanghai Jiao Tong University