Differentially Private Steering for Large Language Model Alignment

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the dual challenges of privacy leakage and hallucination suppression in private-data-driven large language model (LLM) alignment, this paper proposes PSA—the first differential privacy (DP)-enabled alignment framework based on activation editing. Methodologically, PSA integrates DP mechanisms directly into inference-time hidden-layer activation editing, combining representation projection and information suppression guided by positive and negative demonstrations to achieve behavior calibration under privacy constraints. Furthermore, we design the first membership inference attack (MIA) that operates solely on generated text, enabling quantitative auditing of privacy risks inherent in activation editing. Evaluated across seven benchmarks on open-source models (0.5B–7B parameters, including Llama, Qwen, Mistral, and Gemma), PSA provides strict ε-DP guarantees while preserving alignment performance, generation quality, and reasoning capability—achieving, for the first time, a principled unification of DP compliance and activation-editing-based alignment.

Technology Category

Application Category

📝 Abstract

Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the extit{underline{P}rivate underline{S}teering for LLM underline{A}lignment (PSA)} algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our extit{PSA} algorithm compared to several existing non-private techniques.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Privacy Preservation

Training Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Privacy-preserving Adjustment

Large Language Models

PSA Methodology

🔎 Similar Papers

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment