SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a critical security inconsistency in large language models (LLMs) under jailbreaking attacks: their ability to *discriminate* harmful requests significantly exceeds their capacity to *defend* during generation. To address this, we propose Self-Discriminative Guided Optimization (SDGO), a self-aligned reinforcement learning framework that requires no external annotations or auxiliary discriminators. SDGO leverages the model’s own internal harmfulness score—computed over the input—as the reward signal and performs iterative safety-aware fine-tuning to intrinsically align generative and discriminative capabilities. Experiments demonstrate that SDGO substantially improves robustness against out-of-distribution jailbreaking attacks while preserving model utility, consistently outperforming safety baselines based on prompt engineering and supervised fine-tuning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.
Problem

Research questions and friction points this paper is trying to address.

Addresses safety inconsistency in LLMs' discrimination and generation capabilities
Mitigates jailbreaking attacks that induce harmful content generation
Aligns model's inherent discrimination with generation through self-improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Discrimination-Guided Optimization framework
Reinforcement learning with self-reward signals
Aligns discrimination and generation capabilities
🔎 Similar Papers
No similar papers found.
P
Peng Ding
National Key Laboratory for Novel Software Technology, Nanjing University
W
Wen Sun
Meituan Inc., China
D
Dailin Li
Computer Science and Technology, Dalian University of Technology
Wei Zou
Wei Zou
PKU、Samsung、Baidu、Didi、Ke
SpeechNLPLLMMultimodal
Jiaming Wang
Jiaming Wang
National University of Singapore
Generative AIRobotics
J
Jiajun Chen
National Key Laboratory for Novel Software Technology, Nanjing University
Shujian Huang
Shujian Huang
School of Computer Science, Nanjing University
Natural Language ProcessingMachine TranslationMultilingualismLarge Language Models