Differentially Private Steering for Large Language Model Alignment

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of privacy leakage and hallucination suppression in private-data-driven large language model (LLM) alignment, this paper proposes PSA—the first differential privacy (DP)-enabled alignment framework based on activation editing. Methodologically, PSA integrates DP mechanisms directly into inference-time hidden-layer activation editing, combining representation projection and information suppression guided by positive and negative demonstrations to achieve behavior calibration under privacy constraints. Furthermore, we design the first membership inference attack (MIA) that operates solely on generated text, enabling quantitative auditing of privacy risks inherent in activation editing. Evaluated across seven benchmarks on open-source models (0.5B–7B parameters, including Llama, Qwen, Mistral, and Gemma), PSA provides strict ε-DP guarantees while preserving alignment performance, generation quality, and reasoning capability—achieving, for the first time, a principled unification of DP compliance and activation-editing-based alignment.

Technology Category

Application Category

📝 Abstract
Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the extit{underline{P}rivate underline{S}teering for LLM underline{A}lignment (PSA)} algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our extit{PSA} algorithm compared to several existing non-private techniques.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Privacy Preservation
Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Privacy-preserving Adjustment
Large Language Models
PSA Methodology
🔎 Similar Papers
No similar papers found.