π€ AI Summary
This work addresses the vulnerability of large reasoning models to adversarial attacks, particularly their inability to recover from unsafe reasoning trajectories. To tackle this, the authors propose Self-ReSET, a novel framework that leverages the modelβs own unsafe intermediate states as initial states for reinforcement learning, thereby internalizing self-recovery capabilities without relying on external expert demonstrations or adversarial prefix data. By eschewing static expert datasets, Self-ReSET effectively explores a broad generative space and induces stable self-recovery behaviors. Experiments demonstrate that the method significantly enhances robustness against out-of-distribution jailbreaking prompts across multiple large reasoning models and benchmarks, while preserving general performance.
π Abstract
Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self-ReSET.