🤖 AI Summary
Large language models (LLMs) generate homogeneous, context-agnostic responses to vulnerable users in high-stakes scenarios, posing personalized safety risks—yet existing safety evaluations rely on context-free metrics and ignore how user background modulates risk. Method: We introduce the “personalized safety” paradigm, proposing PENGUIN—a benchmark of 14,000 diverse scenarios spanning seven sensitive domains—and RAISE, a lightweight, fine-tuning-free, two-stage planning agent that dynamically models user background via selective information acquisition within an average of 2.7 interaction rounds. RAISE integrates planning-driven architecture, context-sensitive risk modeling, and multi-dimensional user attribute filtering, evaluated via scenario-based adversarial testing. Contribution/Results: Incorporating personalized user information boosts average safety scores of mainstream LLMs by 43.2%; RAISE further improves performance by up to 31.6%, demonstrating the efficacy and practicality of minimal, adaptive personalization for safety enhancement.
📝 Abstract
Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt, posing serious safety risks in high-stakes applications where user vulnerabilities differ widely. Existing safety evaluations primarily rely on context-independent metrics - such as factuality, bias, or toxicity - overlooking the fact that the same response may carry divergent risks depending on the user's background or condition. We introduce personalized safety to fill this gap and present PENGUIN - a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants. Evaluating six leading LLMs, we demonstrate that personalized user information significantly improves safety scores by 43.2%, confirming the effectiveness of personalization in safety alignment. However, not all context attributes contribute equally to safety enhancement. To address this, we develop RAISE - a training-free, two-stage agent framework that strategically acquires user-specific background. RAISE improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining a low interaction cost of just 2.7 user queries on average. Our findings highlight the importance of selective information gathering in safety-critical domains and offer a practical solution for personalizing LLM responses without model retraining. This work establishes a foundation for safety research that adapts to individual user contexts rather than assuming a universal harm standard.