AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing safety alignment methods suffer from excessive refusal, degraded task performance, and reliance on dense supervised data or superficial refusal heuristics—failing to elicit models’ intrinsic safety self-awareness. Method: We propose AlphaAlign, a minimalist pure reinforcement learning framework that employs a dual-reward mechanism: verifiable binary safety rewards and normalized utility rewards—requiring no supervised safety reasoning data. Its core innovation lies in decoupling safety and utility objectives to achieve deep alignment. Contribution/Results: Experiments demonstrate that AlphaAlign significantly improves harmful content detection, drastically reduces over-refusal rates, maintains or enhances general-task performance, and exhibits strong robustness against unseen jailbreaking attacks—thereby enabling proactive, self-driven safety reasoning in large language models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs), despite possessing latent safety understanding from their vast pretraining data, remain vulnerable to generating harmful content and exhibit issues such as over-refusal and utility degradation after safety alignment. Current safety alignment methods often result in superficial refusal shortcuts or rely on intensive supervision for reasoning-based approaches, failing to fully leverage the model's intrinsic safety self-awareness. We propose extbf{AlphaAlign}, a simple yet effective pure reinforcement learning (RL) framework with verifiable safety reward designed to incentivize this latent safety awareness through proactive safety reasoning.} AlphaAlign employs a dual-reward system: a verifiable safety reward encourages correctly formatted and explicitly justified refusals for harmful queries while penalizing over-refusals, and a normalized helpfulness reward guides high-quality responses to benign inputs. This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data. AlphaAlign demonstrates three key advantages: (1) Simplicity and efficiency, requiring only binary prompt safety labels and minimal RL steps for substantial improvements. (2) Breaking the safety-utility trade-off, by enhancing refusal of harmful content and reducing over-refusals, while simultaneously maintaining or even improving general task performance and robustness to unseen jailbreaks. (3) Deep alignment, fostering proactive safety reasoning that generates explicit safety rationales rather than relying on shallow refusal patterns.

Problem

Research questions and friction points this paper is trying to address.

Addresses LLMs' harmful content and over-refusal issues

Leverages intrinsic safety awareness via simplified RL framework

Balances safety-utility trade-off with proactive reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pure RL framework with verifiable safety reward

Dual-reward system for safety and helpfulness

Minimal supervision, proactive safety reasoning

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation