SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

πŸ“… 2025-05-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

206K/year
πŸ€– AI Summary
To mitigate the security risk of adversarial users covertly eliciting harmful behaviors from large language models (LLMs) in multi-turn dialogues, this paper proposes a plug-and-play safety defense mechanism. Methodologically, we introduce the first safety-reasoning elicitation alignment framework tailored for multi-turn interactions, construct the first high-quality, human-annotated dataset of safety-reasoning multi-turn dialogues, and design a lightweight, non-intrusive moderator that requires no modification to the target LLM. Our technical pipeline integrates supervised fine-tuning, fine-grained safety-intent recognition, and dynamic dialogue-state tracking. Evaluated across diverse mainstream LLMs and representative multi-turn attack strategies, our approach reduces attack success rate (ASR) by 51.2% while preserving the original model’s functional integrity. Key contributions include: (1) the first multi-turn safety-reasoning alignment framework; (2) the first high-quality, human-annotated safety-reasoning dialogue dataset; and (3) the first non-invasive, plug-and-play moderator for multi-turn safety moderation.

Technology Category

Application Category

πŸ“ Abstract
Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.
Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against multi-turn malicious dialogue attacks
Identifying hidden harmful intent in multi-turn conversations
Reducing attack success rate while preserving LLM functionality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play safety reasoning moderator
Human-annotated multi-turn dialogues dataset
Reduces attack success rate significantly
Martin Kuo
Martin Kuo
PhD Candidate, Duke University
LLMs Trustworthy AI Generative AI
Jianyi Zhang
Jianyi Zhang
Research Scientist@Google Deepmind, PI@Duke University
LLMsGenerative AITrustworthy AI
Aolin Ding
Aolin Ding
Security Research Scientist, Accenture
L
Louis DiValentin
Accenture, USA
A
Amin Hass
Accenture, USA
Benjamin F Morris
Benjamin F Morris
Duke University
computer architecture
I
Isaac Jacobson
Center for Computational Evolutionary Intelligence, Duke University
Randolph Linderman
Randolph Linderman
Ph.D. Student, Duke University
ML SafetyOut-of-distribution detectionBayesian non-parametrics
J
James Kiessling
Center for Computational Evolutionary Intelligence, Duke University
N
Nicolas Ramos
Center for Computational Evolutionary Intelligence, Duke University
Bhavna Gopal
Bhavna Gopal
PhD student @ Duke University
Computer VisionNeural Architecture SearchAI Safety and PrivacyAdversarial Robustness
M
M. Pouyan
Accenture, USA
C
Changwei Liu
Accenture, USA
H
Hai Li
Center for Computational Evolutionary Intelligence, Duke University
Y
Yiran Chen
Center for Computational Evolutionary Intelligence, Duke University