Tell Me What You Don't Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing

📅 2024-09-25
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
Role-playing agents (RPAs) frequently exhibit insufficient or excessive refusal when confronted with queries conflicting with their assigned role knowledge. To address this, we construct a multi-type conflict query evaluation benchmark and, for the first time, identify separable refusal/answer regions in the forward representation space of RPAs. Building on this finding, we propose a lightweight, gradient-guided low-rank representation editing method—inspired by LoRA—that explicitly steers conflict queries into the refusal region. Our approach significantly improves refusal accuracy (+28.6%) without compromising role consistency, while maintaining stable performance on non-conflict tasks (±0.3% fluctuation). The core contributions are: (1) uncovering an interpretable, structurally separable refusal subspace in the latent representation space; and (2) establishing a novel paradigm for refusal enhancement grounded in representation editing—shifting refusal behavior from output-level heuristics to controllable, geometric manipulation of internal representations.

Technology Category

Application Category

📝 Abstract
Role-Playing Agents (RPAs) have shown remarkable performance in various applications, yet they often struggle to recognize and appropriately respond to hard queries that conflict with their role-play knowledge. To investigate RPAs' performance when faced with different types of conflicting requests, we develop an evaluation benchmark that includes contextual knowledge conflicting requests, parametric knowledge conflicting requests, and non-conflicting requests to assess RPAs' ability to identify conflicts and refuse to answer appropriately without over-refusing. Through extensive evaluation, we find that most RPAs behave significant performance gaps toward different conflict requests. To elucidate the reasons, we conduct an in-depth representation-level analysis of RPAs under various conflict scenarios. Our findings reveal the existence of rejection regions and direct response regions within the model's forwarding representation, and thus influence the RPA's final response behavior. Therefore, we introduce a lightweight representation editing approach that conveniently shifts conflicting requests to the rejection region, thereby enhancing the model's refusal accuracy. The experimental results validate the effectiveness of our editing method, improving RPAs' refusal ability of conflicting requests while maintaining their general role-playing capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhancing refusal capabilities of Role-Playing Agents for conflicting queries
Analyzing representation space to identify rejection and response regions
Improving refusal accuracy via lightweight representation editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation space analysis for refusal regions
Lightweight representation editing approach
Enhancing refusal accuracy without over-refusing
🔎 Similar Papers
No similar papers found.
W
Wenhao Liu
School of Computer Science, Fudan University
S
Siyu An
YouTu Lab, Tencent
Junru Lu
Junru Lu
University of Warwick
natural language processingquestion answering
Muling Wu
Muling Wu
Fudan University
T
Tianlong Li
School of Computer Science, Fudan University
X
Xiaohua Wang
School of Computer Science, Fudan University
Xiaoqing Zheng
Xiaoqing Zheng
Fudan University
Natural Language Processing and Machine Learning
Di Yin
Di Yin
Tencent
LLMNLPMLLM
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
X
Xuanjing Huang
School of Computer Science, Fudan University