Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing large language models (LLMs) employ rigid refusal strategies for potentially harmful queries, overlooking the diversity of user intent and thus compromising the trade-off between safety and user experience. Method: We propose a novel “refusal-as-interaction” paradigm, validated through large-scale human evaluation (480 participants, 3,840 samples), comparative analysis of responses from nine state-of-the-art LLMs, and benchmarking across six reward models. Contribution/Results: We identify “partial compliance”—providing generalized, non-actionable information while omitting executable details—as the optimal safety–utility equilibrium. This strategy reduces negative user perception by over 50%, outperforming improvements in intent classification accuracy; user motivation exhibits negligible impact on experience, whereas refusal modality itself serves as the critical moderating factor. Empirical analysis reveals that current LLMs rarely adopt partial compliance spontaneously, and mainstream reward models substantially undervalue its efficacy.

Technology Category

Application Category

📝 Abstract

Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.

Problem

Research questions and friction points this paper is trying to address.

LLMs refuse harmless queries, harming user experience

Study compares refusal strategies' impact on perceptions

Partial compliance reduces negative perceptions by 50%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Partial compliance reduces negative user perceptions

Response strategy shapes user experience significantly

Reward models undervalue partial compliance strategy

🔎 Similar Papers

Adaptive Guardrails For Large Language Models via Trust Modeling and In-Context Learning