Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the safety degradation of large language models (LLMs) induced by role-playing fine-tuning (RPM), providing the first systematic quantification of associated security risks—particularly under adversarial or “evil” role scenarios. We propose SaRFT, a safety-aware role-playing fine-tuning framework that jointly optimizes role fidelity and safety through integrated safety constraints and role-consistency regularization, compatible with both LoRA and full-parameter fine-tuning paradigms. Evaluated across 95 role-specific models built on RoleBench, SaRFT demonstrates consistent improvements on mainstream base models—including LLaMA-3, Gemma-2, and Qwen2.5—achieving an average 23.6% gain in safety scores while preserving role-playing capability intact. These results validate the effectiveness and generalizability of role-adaptive safety mechanisms in mitigating RPM-induced vulnerabilities.

Technology Category

Application Category

📝 Abstract

Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing safety risks in role-play fine-tuning of LLMs

Proposing SaRFT to balance role-playing and safety

Mitigating role-specific safety risks in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety-Aware Role-Play Fine-Tuning (SaRFT)

RoleBench for comprehensive risk assessment

LoRA and full-parameter fine-tuning settings

🔎 Similar Papers

No similar papers found.