SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

πŸ“… 2025-05-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To mitigate the security risk of adversarial users covertly eliciting harmful behaviors from large language models (LLMs) in multi-turn dialogues, this paper proposes a plug-and-play safety defense mechanism. Methodologically, we introduce the first safety-reasoning elicitation alignment framework tailored for multi-turn interactions, construct the first high-quality, human-annotated dataset of safety-reasoning multi-turn dialogues, and design a lightweight, non-intrusive moderator that requires no modification to the target LLM. Our technical pipeline integrates supervised fine-tuning, fine-grained safety-intent recognition, and dynamic dialogue-state tracking. Evaluated across diverse mainstream LLMs and representative multi-turn attack strategies, our approach reduces attack success rate (ASR) by 51.2% while preserving the original model’s functional integrity. Key contributions include: (1) the first multi-turn safety-reasoning alignment framework; (2) the first high-quality, human-annotated safety-reasoning dialogue dataset; and (3) the first non-invasive, plug-and-play moderator for multi-turn safety moderation.

Technology Category

Application Category

πŸ“ Abstract
Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.
Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against multi-turn malicious dialogue attacks
Identifying hidden harmful intent in multi-turn conversations
Reducing attack success rate while preserving LLM functionality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play safety reasoning moderator
Human-annotated multi-turn dialogues dataset
Reduces attack success rate significantly
πŸ”Ž Similar Papers
No similar papers found.
Martin Kuo
Martin Kuo
PhD Candidate, Duke University
LLMs Trustworthy AI Generative AI
Jianyi Zhang
Jianyi Zhang
Research Scientist@Google Deepmind, PI@Duke University
LLMsGenerative AITrustworthy AI
Aolin Ding
Aolin Ding
Security Research Scientist, Accenture
L
Louis DiValentin
Accenture, USA
A
Amin Hass
Accenture, USA
Benjamin F Morris
Benjamin F Morris
Duke University
computer architecture
I
Isaac Jacobson
Center for Computational Evolutionary Intelligence, Duke University
Randolph Linderman
Randolph Linderman
Ph.D. Student, Duke University
ML SafetyOut-of-distribution detectionBayesian non-parametrics
J
James Kiessling
Center for Computational Evolutionary Intelligence, Duke University
N
Nicolas Ramos
Center for Computational Evolutionary Intelligence, Duke University
Bhavna Gopal
Bhavna Gopal
PhD student @ Duke University
Computer VisionNeural Architecture SearchAI Safety and PrivacyAdversarial Robustness
M
M. Pouyan
Accenture, USA
C
Changwei Liu
Accenture, USA
H
Hai Li
Center for Computational Evolutionary Intelligence, Duke University
Y
Yiran Chen
Center for Computational Evolutionary Intelligence, Duke University