Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

LLM safety alignment suffers from the absence of universal benchmarks and reliable validators, hindering access to effective training signals. This work proposes Self-Initiated Reinforcement Learning (SIRL), a novel reinforcement learning framework grounded in internal confidence estimation. We empirically discover that entropy dynamics during generation serve as an intrinsic, unsupervised safety reward signal: models autonomously trigger low-entropy refusal behaviors upon detecting high-entropy outputs—indicative of potential risks—without human annotations or external supervision. Using only 15,000 unlabeled prompts, SIRL achieves >89% defense success against over 20 jailbreaking attacks on Llama and Qwen, substantially outperforming conventional supervised methods while fully preserving mathematical reasoning, coding, and conversational capabilities. Our core contribution is establishing generative entropy as a learnable, reinforcement-compatible intrinsic safety criterion, pioneering a self-consistent paradigm for safety alignment.

Technology Category

Application Category

📝 Abstract

Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signal--models intrinsically"know"when to refuse. We introduce Safety Instincts Reinforcement Learning (SIRL), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. SIRL teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, SIRL maintains 89%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to adaptive attacks. Using only 15,000 unlabeled prompts, SIRL surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.

Problem

Research questions and friction points this paper is trying to address.

Developing autonomous safety mechanisms for LLMs without external validators

Leveraging internal confidence signals to reinforce refusal behaviors

Maintaining high defense rates against jailbreak attacks while preserving performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses internal confidence as self-generated reward signal

Reinforces low-entropy refusal behaviors for safety

Achieves high defense rates without external validators

🔎 Similar Papers

No similar papers found.