Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Existing post-hoc privacy filtering methods suffer from high latency, substantial computational overhead, and incompatibility with streaming LLM generation. To address these limitations, this paper proposes Self-Sanitize—a novel framework that, for the first time, incorporates cognitive psychology’s self-monitoring and self-repair mechanisms into LLM safety. It enables low-latency, in-place, token-level real-time detection and correction of privacy leaks. The framework comprises two lightweight, streaming-compatible modules: Self-Monitor, which performs intent-level privacy identification via representation engineering; and Self-Repair, which rectifies sensitive content instantaneously without requiring additional dialogue turns. Evaluations across four mainstream LLMs and three privacy leakage scenarios demonstrate that Self-Sanitize significantly reduces privacy leakage rates while introducing negligible average latency (<1 ms) and preserving generation quality and utility.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) achieve remarkable success across a wide range of applications, such as chatbots and code copilots, concerns surrounding the generation of harmful content have come increasingly into focus. Despite significant advances in aligning LLMs with safety and ethical standards, adversarial prompts can still be crafted to elicit undesirable responses. Existing mitigation strategies are predominantly based on post-hoc filtering, which introduces substantial latency or computational overhead, and is incompatible with token-level streaming generation. In this work, we introduce Self-Sanitize, a novel LLM-driven mitigation framework inspired by cognitive psychology, which emulates human self-monitor and self-repair behaviors during conversations. Self-Sanitize comprises a lightweight Self-Monitor module that continuously inspects high-level intentions within the LLM at the token level via representation engineering, and a Self-Repair module that performs in-place correction of harmful content without initiating separate review dialogues. This design allows for real-time streaming monitoring and seamless repair, with negligible impact on latency and resource utilization. Given that privacy-invasive content has often been insufficiently focused in previous studies, we perform extensive experiments on four LLMs across three privacy leakage scenarios. The results demonstrate that Self-Sanitize achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs, offering a practical and robust solution for safer LLM deployments. Our code is available at the following link: https://github.com/wjfu99/LLM_Self_Sanitize

Problem

Research questions and friction points this paper is trying to address.

Mitigating privacy leakage in Large Language Models from adversarial prompts

Reducing latency and overhead of existing post-hoc filtering methods

Enabling real-time monitoring and correction of harmful content generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Sanitize framework mimics human self-monitoring behaviors

Lightweight monitor inspects token-level intentions via representation engineering

In-place correction module repairs harmful content without separate dialogues

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions