Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the safety degradation in multimodal large language models (MLLMs) induced by post-training with reinforcement learning, which enhances reasoning capabilities but compromises alignment and increases vulnerability to jailbreak attacks. To mitigate this risk, the authors propose SafeThink, a lightweight inference-time defense that frames safety recovery as a satisfiability constraint. By leveraging a safety reward model to monitor generation trajectories, SafeThink dynamically detects hazardous outputs within the first one to three decoding steps and injects a short corrective prefix (e.g., “Wait, think safely”) to steer the model toward safe responses. Evaluated across six open-source MLLMs and four jailbreak benchmarks, SafeThink reduces attack success rates by 30–60% while preserving reasoning performance—evidenced by only a marginal drop in MathVista accuracy from 65.20% to 65.00%.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

jailbreak attacks

reasoning models

reinforcement learning

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

SafeThink

safety recovery

inference-time defense