Exploring Safety-Utility Trade-Offs in Personalized Language Models

📅 2024-06-17

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work identifies a significant “security–utility trade-off bias” in large language models (LLMs) during user identity personalization: across demographic identity groups, model robustness against harmful prompts and multi-task capability—encompassing knowledge recall, mathematical reasoning, programming, and logical inference—degrade concurrently and often exhibit negative correlation. We formally define and quantify this bias’s identity-dependent manifestation along both security and utility dimensions, empirically validating its prevalence and inconsistency across Llama, Mistral, GPT-3.5, and GPT-4o. To mitigate the bias, we propose a novel hybrid approach integrating preference optimization with defensive prompt engineering. Furthermore, we construct a multidimensional, fine-grained evaluation benchmark measuring safety response rate and task-specific accuracy. Experimental results demonstrate that our method substantially improves cross-group fairness and system stability without compromising overall performance.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become increasingly integrated into daily applications, it is essential to ensure they operate fairly across diverse user demographics. In this work, we show that LLMs suffer from personalization bias, where their performance is impacted when they are personalized to a user's identity. We quantify personalization bias by evaluating the performance of LLMs along two axes - safety and utility. We measure safety by examining how benign LLM responses are to unsafe prompts with and without personalization. We measure utility by evaluating the LLM's performance on various tasks, including general knowledge, mathematical abilities, programming, and reasoning skills. We find that various LLMs, ranging from open-source models like Llama (Touvron et al., 2023) and Mistral (Jiang et al., 2023) to API-based ones like GPT-3.5 and GPT-4o (Ouyang et al., 2022), exhibit significant variance in performance in terms of safety-utility trade-offs depending on the user's identity. Finally, we discuss several strategies to mitigate personalization bias using preference tuning and prompt-based defenses.

Problem

Research questions and friction points this paper is trying to address.

Safety-utility trade-offs in LLMs

Personalization bias in language models

Mitigating bias with preference tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifying personalization bias

Measuring safety and utility

Mitigating with preference tuning

🔎 Similar Papers

No similar papers found.