REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to multi-step indirect jailbreak attacks, which current surface-level alignment mechanisms struggle to mitigate effectively. The authors propose Reflector, a novel framework that internalizes trajectory-level self-reflection directly into the model’s generation process. It first employs teacher-guided supervised fine-tuning on high-quality reflection data, followed by reinforcement learning augmented with outcome-driven reward validity supervision, enabling the model to autonomously detect and block indirect jailbreak attempts. Without incurring significant computational overhead, Reflector achieves highly scalable safety performance, attaining over 90% defense success against sophisticated indirect attacks. Moreover, it improves performance by 5.85% on GSM8K and demonstrates enhanced generalization on knowledge-intensive tasks.

📝 Abstract

While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

Large Language Models

safety alignment

indirect attacks

internal generation process

Innovation

Methods, ideas, or system contributions that make the work stand out.

step-wise reflection

internalized self-reflection

indirect jailbreak defense