Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the pervasive “safety–capability trade-off” in large language model (LLM) fine-tuning—where improving downstream task performance often degrades safety alignment—by proposing Reinforcement Learning with Verifiable Rewards (RLVR), a novel paradigm. We theoretically prove, for the first time under KL-divergence constraints, that RLVR strictly prevents safety degradation. Empirically, we validate RLVR across multiple benchmarks, including five adversarial safety evaluations, demonstrating substantial gains in reasoning capability while maintaining or even improving safety alignment. Compared to supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), RLVR uniquely integrates verifiability, safety preservation, and task performance—overcoming the inherent inverse relationship between safety and capability in conventional fine-tuning methods.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses safety degradation in LLMs during fine-tuning
Explores reinforcement learning with verifiable rewards for safety
Challenges the inevitable safety-capability tradeoff assumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with verifiable rewards optimizes models
Theoretical analysis derives safety drift bounds under constraints
Empirical experiments maintain safety while enhancing reasoning capabilities
🔎 Similar Papers
No similar papers found.
D
Dongkyu Derek Cho
Department of Statistical Science, Duke University
Huan Song
Huan Song
Amazon AWS AI
Deep learningmachine learninggraph neural networkstime-series analysis
Arijit Ghosh Chowdhury
Arijit Ghosh Chowdhury
Amazon Web Services
NLPArtificial IntelligenceData Science
H
Haotian An
AWS Generative AI Innovation Center
Y
Yawei Wang
AWS Generative AI Innovation Center
R
Rohit Thekkanal
AWS Generative AI Innovation Center
Negin Sokhandan
Negin Sokhandan
Principle Applied Scientist at AWS, University of Calgary , Georgia Institute of Technology
AIStatistical InferenceComputer VisionReinforcement LearningSignal Processing
S
Sharlina Keshava
AWS Generative AI Innovation Center
H
Hannah Marlowe
AWS Generative AI Innovation Center