Quant Fever, Reasoning Blackholes, Schrodinger's Compliance, and More: Probing GPT-OSS-20B

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work systematically evaluates the security vulnerabilities of the open-source large language model GPT-OSS-20B under adversarial conditions. Method: We introduce Jailbreak Oracle, a novel evaluation framework integrating adversarial prompt engineering with fine-grained behavioral probing to rigorously assess model robustness. Contribution/Results: We identify and formally characterize five previously unreported failure modes—Quantization Frenzy, Reasoning Black Hole, Schrödinger Compliance, Reasoning Mirage, and Chain-Directed Prompt Leakage. Empirical analysis reveals structural deficiencies in logical reasoning, value alignment, and instruction following. Multiple attack vectors consistently bypass safety guardrails, inducing severe jailbreaking and misleading outputs. Our findings establish a new analytical paradigm for LLM security mechanistic studies and provide a reproducible, open diagnostic toolkit for rigorous safety assessment.

Technology Category

Application Category

📝 Abstract

OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Us- ing the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on GPT-OSS-20B models, leading to severe consequences.

Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-OSS-20B's security vulnerabilities under adversarial attacks

Identifying failure modes like quant fever and reasoning blackholes

Demonstrating exploitability of harmful behaviors in language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic LLM evaluation tool Jailbreak Oracle

Probes model behavior under adversarial conditions

Uncovers failure modes like quant fever

🔎 Similar Papers

No similar papers found.

OpenAI

$207K – $285K • Offers Equity

San Francisco, CA, USA

Authors to Follow