HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) commonly rely on explicit refusal prefixes for safety alignment, rendering them vulnerable to prefix injection attacks and prone to over-defensive behaviors that impair interaction naturalness and task effectiveness. To address this, we propose humor as an implicit refusal strategy—introducing, for the first time, a safety-decoupled refusal paradigm modeled through humor and redefining LLM safety at the data level. Our approach comprises three core components: (1) data-driven generation of humorous refusal responses, (2) context-aware humor adaptation to preserve relevance and tone, and (3) a robustness evaluation framework specifically designed to assess resilience against adversarial attacks. Experimental results demonstrate substantial improvements in robustness against prefix injection and related threats, significant mitigation of over-defensiveness, and maintenance of high task completion rates—while simultaneously enabling more natural, engaging, and trustworthy human–LLM interactions.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that fundamentally reimagines LLM safety by decoupling it from refusal prefixes through the use of humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests while maintaining engaging interactions. Our approach effectively addresses the common"over-defense"issues in existing safety mechanisms, demonstrating superior robustness against various attack vectors while preserving natural and high-quality interactions on legitimate tasks. Our findings suggest that innovations at the data level are even more fundamental than the alignment algorithm itself in achieving effective LLM safety, opening new directions for developing more resilient and user-friendly AI systems.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
prefix injection attacks
over-defense issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

HumorReject
Data-Driven Safety
Large Language Models
🔎 Similar Papers
No similar papers found.
Zihui Wu
Zihui Wu
PhD student, California Institute of Technology
Computational imaging
H
Haichang Gao
School of Computer Science and Technology, Xidian University
J
Jiacheng Luo
School of Computer Science and Technology, Xidian University
Zhaoxiang Liu
Zhaoxiang Liu
China Unicom
Computer VisionDeep LearningRoboticsHuman-Computer Interaction