π€ AI Summary
This work addresses the vulnerability of large language models to multi-step indirect jailbreak attacks, which current surface-level alignment mechanisms struggle to mitigate effectively. The authors propose Reflector, a novel framework that internalizes trajectory-level self-reflection directly into the modelβs generation process. It first employs teacher-guided supervised fine-tuning on high-quality reflection data, followed by reinforcement learning augmented with outcome-driven reward validity supervision, enabling the model to autonomously detect and block indirect jailbreak attempts. Without incurring significant computational overhead, Reflector achieves highly scalable safety performance, attaining over 90% defense success against sophisticated indirect attacks. Moreover, it improves performance by 5.85% on GSM8K and demonstrates enhanced generalization on knowledge-intensive tasks.
π Abstract
While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.