🤖 AI Summary
Large language models (LLMs) commonly rely on explicit refusal prefixes for safety alignment, rendering them vulnerable to prefix injection attacks and prone to over-defensive behaviors that impair interaction naturalness and task effectiveness. To address this, we propose humor as an implicit refusal strategy—introducing, for the first time, a safety-decoupled refusal paradigm modeled through humor and redefining LLM safety at the data level. Our approach comprises three core components: (1) data-driven generation of humorous refusal responses, (2) context-aware humor adaptation to preserve relevance and tone, and (3) a robustness evaluation framework specifically designed to assess resilience against adversarial attacks. Experimental results demonstrate substantial improvements in robustness against prefix injection and related threats, significant mitigation of over-defensiveness, and maintenance of high task completion rates—while simultaneously enabling more natural, engaging, and trustworthy human–LLM interactions.
📝 Abstract
Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that fundamentally reimagines LLM safety by decoupling it from refusal prefixes through the use of humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests while maintaining engaging interactions. Our approach effectively addresses the common"over-defense"issues in existing safety mechanisms, demonstrating superior robustness against various attack vectors while preserving natural and high-quality interactions on legitimate tasks. Our findings suggest that innovations at the data level are even more fundamental than the alignment algorithm itself in achieving effective LLM safety, opening new directions for developing more resilient and user-friendly AI systems.