Bag of Tricks for Subverting Reasoning-based Safety Guardrails

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work exposes a systemic vulnerability in safety mechanisms of Large Reasoning Models (LRMs): current alignment strategies are brittle against subtle prompt manipulations—particularly template injection—enabling bypass and harmful content generation. To systematically exploit this flaw, the authors propose a suite of jailbreaking methods spanning white-box, gray-box, and black-box settings, integrating prompt template injection, automated attack optimization, and cross-model transfer techniques—scalable to both local models and online APIs. Evaluated across the GPT-oss series and multiple mainstream open-source LRMs, the attacks achieve >90% success rates. This is the first systematic study to reveal the failure modes of reasoning-based safety guards under template injection, providing critical empirical evidence and concrete directions for developing robust alignment techniques.

Technology Category

Application Category

📝 Abstract

Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs' reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown a significant boost in defense, such as the near-perfect refusal rates on the open-source gpt-oss series. Unfortunately, we find that these powerful reasoning-based guardrails can be extremely vulnerable to subtle manipulation of the input prompts, and once hijacked, can lead to even more harmful results. Specifically, we first uncover a surprisingly fragile aspect of these guardrails: simply adding a few template tokens to the input prompt can successfully bypass the seemingly powerful guardrails and lead to explicit and harmful responses. To explore further, we introduce a bag of jailbreak methods that subvert the reasoning-based guardrails. Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization. Along with the potential for scalable implementation, these methods also achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 different benchmarks on gpt-oss series on both local host models and online API services). Evaluations across various leading open-source LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is open-sourced at https://chenxshuo.github.io/bag-of-tricks.

Problem

Research questions and friction points this paper is trying to address.

Revealing vulnerabilities in reasoning-based safety guardrails of large language models

Demonstrating how simple prompt manipulations can bypass advanced safety mechanisms

Developing multi-setting jailbreak methods that achieve high attack success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bypassing safety guardrails with template tokens

Introducing multi-level jailbreak methods for subversion

Achieving high attack rates across diverse LRMs

🔎 Similar Papers

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks